Fastest way to iterate permutation with value guidelines - python

I have an array of dicts that I need each combination of without duplicates based on no repeating id value and a sum of a ratio value
So the results would be:
results = [
[
{
'id': 1
'ratio': .01
},
{
'id': 2
'ratio': .99
},
],
[
{
'id': 1
'ratio': .50
},
{
'id': 2
'ratio': .50
},
],
[ ... ],
[ ... ],
]
For example:
_array_list = [
{
'id': 1
'ratio': .01
},
{
'id': 1
'ratio': .02
},
....
{
'id': 2
'ratio': .01
}
{
'id': 3
'ratio': .02
}
...
]
Each id has between .01-1.0 by .01
I then do to get each possible combination
(there is a reason for this but i am leaving out the stuff that hasn't anything to do with the issue)
from itertools import combinations
unique_list_count = 2 #(this is each id)
all_combos = []
for i in range(1,len(unique_list_count)+1):
for combo in combinations(_array_list , i):
_iter_count += 1
ids = []
# if iter_count > 1:
# break
for c in combo:
ids.append(c['id'])
is_id_duplicate = len(ids) != len(set(ids))
if is_id_duplicate is False:
# make sure only appending full values
if sum(v['ratio'] for v in combo) == 1.0:
iter_count += 1
print(iter_count, _iter_count)
all_combos.append(list(combo))
I'm not sure if this is a good way or if i can even make this better but it works. The issue is that when i have 5 IDs, each with 100 dictionaries, it will do about 600,000,000 combinations and take about 20 minutes
Is there a way to do this in a more efficient and faster way?

You could use the below code. The advantage of using it is that it won't consider cases with repeating ids:
import itertools
from math import isclose
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
def combosSumToOneById(inDictArray):
results = []
uniqueIds = {d['id'] for d in inDictArray}
valuesDict = {id:[d['ratio'] for d in inDictArray if d['id']==id] for id in uniqueIds}
for idCombo in powerset(uniqueIds):
for valueCombo in itertools.product(*[v for k,v in valuesDict.items() if k in idCombo]):
if isclose(sum(valueCombo), 1.0):
results.append([{'id':xid, 'ratio': xv} for xid, xv in zip(idCombo, valueCombo)])
return results
I tested it on the below input
_array_list = [
{
'id': '1',
'ratio': .1
},
{
'id': '1',
'ratio': .2
},
{
'id': '2',
'ratio': .9
},
{
'id': '2',
'ratio': .8
},
{
'id': '3',
'ratio': .8
}]
combosSumToOneById(_array_list)
Returns: [[{'id': '1', 'ratio': 0.1}, {'id': '2', 'ratio': 0.9}], [{'id': '1', 'ratio': 0.2}, {'id': '2', 'ratio': 0.8}], [{'id': '1', 'ratio': 0.2}, {'id': '3', 'ratio': 0.8}]]
Yous should test it if the performance really exceeds the previous one.
Please note that I modified the code to check for isclose(sum, 1.0) rather than sum == 1.Since we are summing double values there most likely will be some error from the representation of the numbers which is why using this condition seems more appropriate.

Until someone who understands the algorithm better than I do comes along, I don't think there's any way of speeding that up with the data types you have.
With the algorithm you are cuurently using:
can you pre-sort your data and filter some branches out that way?
is the ratio sum test more likely to fail than the duplicate test? if so move it above.
drop the print (obviously)
avoid the cast to list from tuple when appending
And then use a multiprocessing.Pool() to use all your cpus at once. Since this is cpu-bound it will get you a reasonable speed up.
But I'm sure there is a more efficient way of doing this. You haven't said how you're getting your data, but if you can represent in an array it might be vectorisable, which will be orders of magnitude faster.

I assume the general case where not each id has all values in [0.01, 1.0].
There are 3 main optimisations you can make and they all aim to instantly drop branches that are guaranteed to not satisfy your conditions.
1. Split the ratios of each id in a dictionary
This way you instantly avoid pointless combinations, .e.g., [{'id': 1, 'ratio': 0.01}, {'id': 1, 'ratio': 0.02}]. It also makes it easier to try combinations between ids. So, instead of having everything in a flat list of dicts, reorganise the data in the following form:
# if your ids are 0-based consecutive integer numbers, a list of lists works too
array_list = {
1: [0.01, 0.02, ...],
2: [0.01, 0.02, ...],
3: [...],
}
2. For an N-size pairing, you have N-1 degrees of freedom
If you're searching for a triplet and you already have (0.54, 0.33, _), you don't have to search all possible values for the last id. There is only one that can satisfy the condition sum(ratios) == 1.0.
3. You can further restrict the possible value range of each id based on the min/max values of the others.
Say you have 3 ids and they all have all the values in [0.01, 0.44]. It is pointless to try any combinations for (0.99, _, _), because the minimum sum for the last two ids is 0.02. Therefore, the maximum value that the first id can explore is 0.98 (well, 0.44 in this example but you get my drift). Similarly, if the maximum sum of the last two ids is 0.88, there is no reason to explore values below 0.12 for the first id. A special case of this is where the sum of the minimum value of all ids is more than 1.0 (or the max < 1.0), in which case you can instantly drop this combination and move on.
Using integers instead of floats
You are blessed in dealing only with some discrete values, so you're better off converting everything to integers. The first reason is to avoid any headaches with floating arithmetic. Case in point, did you know that your code misses some combinations exactly due to these inaccuracies?
And since you will be generating your own value ranges due to optimisation #3, it's much simpler to do for i in range(12, 99) than some roundabout way to generate all values in [0.12, .99) while making sure everything is properly rounded off at the second decimal digit AND THEN properly added together and checked against some tolerance value close to 1.0.
Code
from collections import defaultdict
import itertools as it
def combined_sum(values):
def _combined_sum(values, comb, partial_sum, depth, mins, maxs):
if depth == len(values) - 1:
required = 100 - partial_sum
if required in values[-1]:
yield comb + (required,)
else:
min_value = mins[depth+1]
max_value = maxs[depth+1]
start_value = max(min(values[depth]), 100 - partial_sum - max_value)
end_value = max(1, 100 - partial_sum - min_value)
for value in range(start_value, end_value+1):
if value in values[depth]:
yield from _combined_sum(values, comb+(value,), partial_sum+value, depth+1, mins, maxs)
# precompute all the partial min/max sums, because we will be using them a lot
mins = [sum(min(v) for v in values[i:]) for i in range(len(values))]
maxs = [sum(max(v) for v in values[i:]) for i in range(len(values))]
if mins[0] > 100 or maxs[0] < 100:
return []
return _combined_sum(values, tuple(), 0, 0, mins, maxs)
def asset_allocation(array_list, max_size):
# a set is preferred here because we will be checking a lot whether
# a value is in the iterable, which is faster a set than in a tuple/list
collection = defaultdict(set)
for d in array_list:
collection[d['id']].add(int(round(d['ratio'] * 100)))
all_combos = []
for i in range(1, max_size+1):
for ids in it.combinations(collection.keys(), i):
values = [collection[ID] for ID in ids]
for group in combined_sum(values):
all_combos.append([{'id': ID, 'ratio': ratio/100} for ID, ratio in zip(ids, group)])
return all_combos
array_list = [{'id': ID, 'ratio': ratio/100}
for ID in (1, 2, 3, 4, 5)
for ratio in range(1, 101)
]
max_size = 5
result = asset_allocation(array_list, max_size)
This finishes in 14-15 seconds on my machine.
For comparison, for 3 ids this finishes in 0.007 seconds and Gabor's solution which effectively implements only optimisation #1 finishes in 0.18 seconds. For 4 ids it's .43 s and 18.45 s respectively. For 5 ids I stopped timing his solution after a few minutes, but it was expected to take at least 10 minutes.
If you are dealing with the case where all ids have all the values in [0.01, 1.0] and you insist on having the specific output indicated in your question, the above approach is still optimal. However, if you are okay with generating the output in a different format, you can do better.
For a specific group size, e.g., singles, pairs, triplets, etc, generate all the partitions that add up to 100 using the stars and bars approach. That way, instead of generating (1, 99), (2, 98), etc, for each pair of ids, i.e., (1, 2), (1, 3) and (2, 3), you do this only once.
I've modified the code from here to not allow for 0 in any partition.
import itertools as it
def partitions(n, k):
for c in it.combinations(range(1, n), k-1):
yield tuple(b-a for a, b in zip((0,)+c, c+(n,)))
def asset_allocation(ids, max_size):
all_combos = []
for k in range(1, max_size+1):
id_comb = tuple(it.combinations(ids, k))
p = tuple(partitions(100, k))
all_combos.append((id_comb, p))
return all_combos
ids = (1, 2, 3, 4, 5)
result = asset_allocation(ids, 5)
This finishes much faster, takes up less space, and also allows you to home in to all the combinations for singles, pairs, etc, individually. Now, if you were to take the product of id_comb and p to generate the output in your question, you'd lose all that time saved. In fact, it'd come out as a biiit slower than the general method from above, but at least this piece of code is still more compact.

Related

Giving a composite score for merged dictionary keys

I am working on a project that needs to say that a certain ID is most likely.
Let me explain using example.
I have 3 dictionaries that contain ID's and their score
Ex: d1 = {74701: 3, 90883: 2}
I assign percentage score like this,
d1_p = {74701: 60.0, 90883: 40.0} , here the score is the (value of key in d1)/(total sum of values)
Similarly i have 2 other dictionaries
d2 = {90883: 2, 74701: 2} , d2_p = {90883.0: 50.0, 74701.0: 50.0}
d3 = {75853: 2}, d3_p = {75853: 100.0}
My task is to give a composite score for each ID from the above 3 dictionaries a decide a winner by taking the highest score. How would i mathematically assign a composite score between 0-100 for each of these ID's??
Ex: in above case 74701 needs to be the clear winner.
I tried giving average, but it fails, because I need to give more preference for the ID's that occur in multiple dictionaries.
Ex: lets say 74701 was majority in d1 and d2 with 30,40 values. then its average will be (30+40+0)/3 = 23.33 , while 75853 which occurs only once with 100% will get (100+0+0)/3 = 33.33 and it will be given as winner, which is wrong.
Hence can somone suggest a good mathematical way in python with maybe code to give such score and decide majority?
Instead of trying to create a global score from different dictionaries, since your main goal is to analyze frequency I would suggest to summarize all the data into a single dictionary, which is less error prone in general. Say I have 3 dictionaries:
a = {1: 2, 2: 3}
b = {2: 4, 3: 5}
c = {3: 4, 4: 9}
You could summarize these three dictionaries into one by summing the values for each key:
result = {1: 2, 2: 7, 3: 9, 4: 9}
That could be easily achieved by using Counter:
from collections import Counter
result = Counter(a)
result.update(Counter(b))
result.update(Counter(c))
result = dict(result)
Which would yield the desired summary. If you want different weights for each dictionary that could also be done in a similar fashion, the takeaway is that you should not be trying to obtain information from the dictionaries as separate entities, but instead merge them together into one statistic.
Think of the data in a tabular way: for each game/match/whatever, each ID gets
a certain number of points. If you care the most about overall point total for
the entire sequences of games (the entire "season", so to speak), then add up
the points to determine a winner (and then scale everything down/up to 0 to
100).
74701 90883 75853
---------------------------
1 3 2 0
2 2 2 0
3 0 0 2
Total 5 4 2
Alternatively, we can express those same scores in percentage terms per game.
Again, every ID must be given a value. In this case, we need to average the
percentages -- all of them, including the zeros:
74701 90883 75853
---------------------------
1 .6 .4 0
2 .5 .5 0
3 0 0 100
Avg .37 .30 .33
Both approaches could make sense, depending on the context. And both also
declare 74701 to be the winner, as desired. But notice that they give different
results for 2nd and 3rd place. Such differences occur because the two systems
prioritize different things. You need to decide which approach you prefer.
Either way, the first step is to organize the data better. It seems more
convenient to have all scores or percentages for each ID, so you can do the
needed math with them: that sounds like a dict mapping IDs to lists of scores
or percentages.
# Put the data into one collection.
d1 = {74701: 3, 90883: 2}
d2 = {90883: 2, 74701: 2}
d3 = {75853: 2}
raw_scores = [d1, d2, d3]
# Find all IDs.
ids = tuple(set(i for d in raw_scores for i in d))
# Total points/scores for each ID.
points = {
i : [d.get(i, 0) for d in raw_scores]
for i in ids
}
# If needed, use that dict to create a similar dict for percentages. Or you
# could create a dict with the same structure holding *both* point totals and
# percentages. Just depends on the approach you pick.
pcts = {}
for i, scores in points.items():
tot = sum(scores)
pcts[i] = [sc / tot for sc in scores]

Is there a faster way to find co-occurrence of two elements in list of lists

i have a list like this.
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
I'll be given x (eg: 'a1'). I have to find the co-occurrence of a1 with every other element and sort it and retrieve the top n (eg: top 2)
my answer should be
[
{'product_id': 'b2', 'count': 4},
{'product_id': 'c3', 'count': 3},
]
my current code looks like this:
def compute (x):
set_a = list(set(list(itertools.chain(*a))))
count_dict = []
for i in range(0, len(set_a)):
count = 0
for j in range(0, len(a)):
if x==set_a[i]:
continue
if x and set_a[i] in a[j]:
count+=1
if count>0:
count_dict.append({'product_id': set_a[i], 'count': count})
count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
return count_dict
And it works beautifully for smaller inputs. However my actual input has 70000 unique items instead of 5 (a to e) and 1.3 million rows instead of 5. And hence mxn becomes very exhaustive. Is there a faster way to do this?
"Faster" is a very general term. Do you need a shorter total processing time, or shorter response time for a request? Is this for only one request, or do you want a system that handles repeated inputs?
If what you need is the fastest response time for repeated inputs, then convert this entire list of lists into a graph, with each element as a node, and the edge weight being the number of occurrences between the two elements. You make a single pass over the data to build the graph. For each node, sort the edge list by weight. From there, each request is a simple lookup: return the weight of the node's top edge, which is a hash (linear function) and two direct access operations (base address + offset).
UPDATE after OP's response
"fastest response" seals the algorithm, then. What you want to have is a simple dict, keyed by each node. The value of each node is a sorted list of related elements and their counts.
A graph package (say, networkx) will give you a good entry to this, but may not retain a node's edges in fast form, nor sorted by weight. Instead, pre-process your data base. For each row, you have a list of related elements. Let's just look at the processing for some row in the midst of the data set; call the elements a5, b2, z1, and the dict d. Assume that a5, b2 is already in your dict.
using `intertools`, Iterate through the six pairs.
(a5, b2):
d[a5][b2] += 1
(a5, z1):
d[a5][z1] = 1 (creates a new entry under a5)
(b2, a5):
d[b2][a5] += 1
(b2, z1):
d[b2][z1] = 1 (creates a new entry under b2)
(z1, a5):
d[z1] = {} (creates a new z1 entry in d)
d[z1][a5] = 1 (creates a new entry under z1)
(z1, b2):
d[z1][b2] = 1 (creates a new entry under z1)
You'll want to use defaultdict to save you some hassle to detect and initialize new entries.
With all of that handled, you now want to sort each of those sub-dicts into order based on the sub-level values. This leaves you with an ordered sequence for each element. When you need to access the top n connected elements, you go straight to the dict and extract them:
top = d[elem][:n]
Can you finish the coding from there?
as mentioned by #prune it is not mentioned that do you want a shorter processing time or shorter response time.
So I will explain two approach to this problem
The optimised code approach (for less processing time)
from heapq import nlargest
from operator import itemgetter
#say we have K THREADS
def compute (x, top_n=2):
# first you find the unique items and save them somewhere easily accessible
set_a = list(set(list(itertools.chain(*a))))
#first find that in which of your ROWS the x exists
selected_rows=[]
for i,row in enumerate(a): #this whole loop can be parallelized
if x in row:
selected_rows.append(i) #append index of the row in selected_rows array
# time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row # can be evaluated independently M items can be evaluated independently
# THE M rows can be run in parallel, if we have K threads
# it is going to take us (M/K)*N time complexity to run it.
count_dict=[]
# now the same thing you did earlier but now in second loop we are looking for less rows
for val in set_a:
if val==x:
continue
count=0
for ri in selected_rows: # this whole part can be parallelized as well
if val in a[ri]:
count+=1
count_dict.append({'product_id':val, 'count': count})
# if our selected rows size on worst case is M itself
# and our unique values are U, the complexity
# will be (U/K)*(M/K)*N
res = nlargest(top_n, count_dict, key = itemgetter('count'))
return res
Lets calculate time complexity here
If we have K threads then
O((M/K)*N)+O((U/K)*(M/K)*N))
where
M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
Graph approach as suggested by Prune
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
big_dictionary[x]=[]
for y in set_a:
count=0
if x==y:
continue
for arr in a:
if (x in arr) and (y in arr):
count+=1
big_dictionary[x].append((y,count))
for x in big_dictionary:
big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)
Lets calculate time complexity for this one here
One time complexity will be:
O(U*U*M*N)
where
M---> Total rows
N---> Total Columns
U---> Unique Values
But once this big_dictionary is calculated once,
It will take you just 1 step to get your topN values
For example if we want to get top3 values for a1
result=big_dictionary['a1'][:3]
I followed the defaultdict approach as suggested by #Prune above. Here's the final code:
from collections import defaultdict
def recommender(input_item, b_list, n):
count =[]
top_items = []
for x in b.keys():
lst_2 = b[x]
common_transactions = len(set(b_list) & set(lst_2))
count.append(common_transactions)
top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
top_values_counts = [count[i] for i in top_ids]
key_list = list(b.keys())
for i,v in enumerate(top_ids):
item_id = key_list[v]
top_items.append({item_id: top_values_counts[i]})
print(top_items)
return top_items
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
b = defaultdict(list)
for i, s in enumerate(a):
for key in s :
b[key].append(i)
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)
Here's the output for top 3 for 'a1':
[{'b2': 4}, {'c3': 3}, {'d4': 2}]
Thanks!!!

Sorting a list to maximize similarity between neighboring items

I have a list of keys:
['A', 'B', 'C']
For each of those keys there's a list of properties:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
I wish to sort the list of labels such that neighboring labels share as many of the properties as possible.
In the above example A and B share the relation 2, so they should be next to each other - whereas C shares nothing with them, so it can go anywhere.
So the possible orders for this example would be as follows:
["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable
Buckets
Actually I would prefer this to be represented by putting them into "buckets":
[["A", "B"], ["C"]] # this can represent all four possible orders above.
However, this gets problematic if a label belongs to two different buckets:
{
'A': [2,3],
'B': [1,2],
'C': [1,4]
}
How would I represent that?
I could put it like this:
[["A", "B"], ["C", "B"]]
But then I need another processing step to turn the list of buckets
into the final representation:
["A", "B", "C"]
And above that there could be recursively nested buckets:
[[["A","B"], ["C"]], ["D"]]
And then these could overlap:
[[["A","B"], ["C"]], ["A","D"]]
Quality
The "closeness", i.e. quality of a solution is defined as the sum of the intersection of relations between neighbors (the higher the quality the better):
def measurequality(result,mapping):
lastKey = None
quality = 0
for key in result:
if lastKey is None:
lastKey = key
continue
quality += len(set(mapping[key]).intersection(mapping[lastKey]))
lastKey = key
return quality
# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
# {
# 'A': [2,3],
# 'B': [1,2],
# 'C': [4]
# })
Brute-Forcing
Brute-forcing does not constitute a solution (in practice the list contains on the order of several thousand elements - though, if anyone got a brute-forcing approach that is better than O(n²)...).
However, using brute-forcing to create additional test cases is possible:
produce a list L of n items ['A','B','C',...]
produce for each item a dictionary R of relations (up to n random numbers between 0 and n should be sufficient).
produce all possible permutations of L and feed them together with R into measurequality() and keep those with maximal return value (might not be unique).
Code for creating random testcases to test the implementation:
import string
import random
def randomtestcase(n):
keys=list(string.ascii_uppercase[0:n])
minq = 0
maxq = 0
while minq == maxq:
items={}
for key in keys:
items[key] = random.sample(range(1,10),int(random.random()*10))
minq = n*n
minl = list(keys)
maxq = 0
maxl = list(keys)
for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
random.shuffle(keys)
q = measurequality(keys,items)
if q < minq:
minq = q
minl = list(keys)
if maxq < q:
maxq = q
maxl = list(keys)
return ( items, minl, maxq )
( items, keys, quality ) = randomtestcase(5)
sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))
Attempt
One of the many "solutions" that didn't work (very broken, this one doesn't have the selection of initial element / choice between prepending and appending to the result list that I had in others):
def dosomething(keys,items):
result = []
todo = list(keys)
result.append(todo.pop())
while any(todo):
lastItems = set(items[result[-1]])
bestScore = None
bestKey = None
for key in todo:
score = set(items[key]).intersection(lastItems)
if bestScore is None or bestScore < score:
bestScore = score
bestKey = key
todo.remove(bestKey)
result.append(bestKey)
return result
Examples
(Also check out the example generator in the section Brute-Forcing above.)
Testing code trying some examples:
def test(description,acceptable,keys,arguments):
actual = dosomething(keys,arguments)
if "".join(actual) in acceptable:
return 0
print("\n[{0}] {1}".format("".join(keys),description))
print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
print("Quality of result: {0}".format(measurequality(actual,arguments)))
print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
return 1
print("EXAMPLES")
failures = 0
# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
["A","B","C"],
["A","C","B"],
["B","A","C"],
["B","C","A"],
["C","A","B"],
["C","B","A"]
]:
failures += test(
"1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
["ABC", "BAC", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [4]
})
failures += test(
"2. They all have 2, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [2]
})
failures += test(
"3. A and B share 2, B and C share 1, so B must be in the middle.",
["ABC", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1]
})
failures += test(
"4. Each shares something with each other, creating a cycle, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1,3]
})
if 0 < failures:
print("{0} FAILURES".format(failures))
Precedence
As it was asked: the numbers used for the relations aren't in an order of precedence. An order of precedence exists, but it's a partial order and not the one of the numbers. I just didn't mention it because it makes the problem harder.
So given this example:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
Might be replaced by the following (using letters instead of numbers and adding precedence information):
{
'A': [('Q',7),('R',5)],
'B': [('P',6),('Q',6)],
'C': [('S',5)]
}
Note that
The precedence is meaningful only within a list, not across lists.
The precedence of shared relations might be different between two lists.
Within a list there might be the same precedence several times.
This is a Travelling Salesman Problem, a notoriously hard problem to solve optimally. The code presented solves for 10,000 nodes with simple interconnections (i.e. one or two relations each) in about 15 minutes. It performs less well for sequences that are more richly interconnected. This is explored in the test results below.
The idea of precedence, mentioned by the OP, is not explored.
The presented code consists of a heuristic solution, a brute-force solution that is optimal but not practical for large node_sets, and some simple but scalable test data generators, some with known optimal solutions. The heuristic finds optimal solutions for the OP's 'ABC' example, my own 8-item node_set, and for scalable test data for which optimal solutions are known.
If the performance is not good enough, at least it is a bit of a first effort and has the beginnings of a test 'workshop' to improve the heuristic.
Test Results
>>> python sortbylinks.py
===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1
===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6
Short, relatively trivial cases appear to be no problem.
===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
The "Quality <1000, cycling subsequence" (sequence length 501) is interesting. By grouping nodes with {0, 1} relation sets the quality score can be nearly doubled. The heuristic finds this optimal sequence. Quality 1000 is not quite possible because these double-linked groups need attaching to each other via a single-linked node every so often (e.g. {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}).
Performance is still good for this test data with few relations per node.
"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
One of the difficulties with Travelling Salesman Problems (TSPs) is knowing when you have an optimal solution. The heuristic doesn't seem to converge any faster even from a near-optimal or optimal start.
===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000
When there are very small numbers of relations, even if there are many nodes, performance is pretty good and the results may be close to optimal.
===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?
===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?
This is more like the data generated by the OP, and also more like the classical Travelling Salesman Problem (TSP) where you have a set of distances between each city pair (for 'city' read 'node') and nodes are typically richly connected to each other. In this case the links between nodes is partial-- there is no guarantee of a link between any 2 nodes.
The heuristic's time performance is much worse in cases like this. There are between 0 and n random relations for each node, for n nodes. This is likely to mean many more swap combinations yield improved quality, swaps and quality checks are more expensive in themselves and many more passes will be needed before the heuristic converges on its best result. This may mean O(n^3) in the worst case.
Performance degrades as the number of nodes and relations increases (note the difference between n=200-- 3 minutes-- and n=500-- 70 minutes.) So currently the heuristic may not be practical for several thousand richly-interconnected nodes.
In addition, the quality of the result for this test can't be known precisely because a brute-force solution is not computationally feasible. 6861 / 200 = 34.3 and 41999 / 500 = 84.0 average connections between node pairs doesn't look too far off.
Code for the Heuristic and Brute Force Solvers
import sys
from collections import deque
import itertools
import random
import datetime
# TODO: in-place swapping? (avoid creating copies of sequences)
def timing(f):
"""
decorator for displaying execution time for a method
"""
def wrap(*args):
start = datetime.datetime.now()
f_return_value = f(*args)
end = datetime.datetime.now()
print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
return f_return_value
return wrap
def flatten(a):
# e.g. [a, [b, c], d] -> [a, b, c, d]
return itertools.chain.from_iterable(a)
class LinkAnalysis:
def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
"""
:param node_set: node_ids and their relation sets to be arranged in sequence
:param max_ram: estimated maximum RAM to use
:param generate_seeds: if true, attempt to generate some initial candidates based on sorting
"""
self.node_set = node_set
self.candidates = {}
self.generate_seeds = generate_seeds
self.seeds = {}
self.relations = []
# balance performance and RAM using regular 'weeding'
candidate_size = sys.getsizeof(deque(self.node_set.keys()))
self.weed_interval = max_ram // candidate_size
def create_initial_candidates(self):
print('Working...')
self.generate_seed_from_presented_sequence()
if self.generate_seeds:
self.generate_seed_candidates()
def generate_seed_from_presented_sequence(self):
"""
add initially presented order of nodes as one of the seed candidates
this is worthwhile because the initial order may be close to optimal
"""
presented_sequence = self.presented_sequence()
self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)
def presented_sequence(self) -> list:
return list(self.node_set.keys()) # relies on Python 3.6+ to preserve key order in dicts
def generate_seed_candidates(self):
initial_sequence = self.presented_sequence()
# get relations from the original data structure
relations = sorted(set(flatten(self.node_set.values())))
# sort by lowest precedence relation first
print('...choosing seed candidates')
for relation in reversed(relations):
# use true-false ordering: in Python, True > False
initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
sq = self.quality(initial_sequence)
self.seeds[tuple(initial_sequence)] = sq
def quality(self, sequence):
"""
calculate quality of full sequence
:param sequence:
:return: quality score (int)
"""
pairs = zip(sequence[:-1], sequence[1:])
scores = [len(self.node_set[a].intersection(self.node_set[b]))
for a, b in pairs]
return sum(scores)
def brute_force_candidates(self, sequence):
for sequence in itertools.permutations(sequence):
yield sequence, self.quality(sequence)
def heuristic_candidates(self, seed_sequence):
# look for solutions with higher quality scores by swapping elements
# start with large distances between elements
# then reduce by power of 2 until swapping next-door neighbours
max_distance = len(seed_sequence) // 2
max_pow2 = int(pow(max_distance, 1/2))
distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
for distance in distances:
yield from self.seed_and_variations(seed_sequence, distance)
# seed candidate may be better than its derived sequences -- include it as a candidate
yield seed_sequence, self.quality(seed_sequence)
def seed_and_variations(self, seed_sequence, distance=1):
# swap elements at a distance, starting from beginning and end of the
# sequence in seed_sequence
candidate_count = 0
for pos1 in range(len(seed_sequence) - distance):
pos2 = pos1 + distance
q = self.quality(seed_sequence)
# from beginning of sequence
yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
# from end of sequence
yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
candidate_count += 2
if candidate_count > self.weed_interval:
self.weed()
candidate_count = 0
def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
"""
swap and return quality (which can easily be calculated from present quality
:param sequence: as for swap
:param pos1: as for swap
:param pos2: as for swap
:param preswap_sequence_q: quality of pre-swapped sequence
:return: swapped sequence, quality of swapped sequence
"""
initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
swapped_sequence = self.swap(sequence, pos1, pos2)
swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
qdelta = swapped_node_q - initial_node_q
swapped_sequence_q = preswap_sequence_q + qdelta
return swapped_sequence, swapped_sequence_q
def swap(self, sequence, node_pos1: int, node_pos2: int):
"""
deques perform better than lists for swapping elements in a long sequence
:param sequence-- sequence on which to perform the element swap
:param node_pos1-- zero-based position of first element
:param pos2--: zero-based position of second element
>>> swap(('A', 'B', 'C'), 0, 1)
('B', 'A', 'C')
"""
if type(sequence) is tuple:
# sequence is a candidate (which are dict keys and hence tuples)
# needs converting to a list for swap processing
sequence = list(sequence)
if node_pos1 == node_pos2:
return sequence
tmp = sequence[node_pos1]
sequence[node_pos1] = sequence[node_pos2]
sequence[node_pos2] = tmp
return sequence
def node_quality(self, sequence, pos):
if pos < 0:
pos = len(sequence) + pos
no_of_links = 0
middle_node_relations = self.node_set[sequence[pos]]
if pos > 0:
left_node_relations = self.node_set[sequence[pos - 1]]
no_of_links += len(left_node_relations.intersection(middle_node_relations))
if pos < len(sequence) - 1:
right_node_relations = self.node_set[sequence[pos + 1]]
no_of_links += len(middle_node_relations.intersection(right_node_relations))
return no_of_links
#timing
def optimise_order(self, selection_strategy):
top_score = 0
new_top_score = True
self.candidates.update(self.seeds)
while new_top_score:
top_score = max(self.candidates.values())
new_top_score = False
initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
for initial_candidate in initial_candidates:
for candidate, q in selection_strategy(initial_candidate):
if q > top_score:
new_top_score = True
top_score = q
self.candidates[tuple(candidate)] = q
self.weed()
print(f"Number of candidates with top score: {len(list(self.candidates))}")
print(f"{list(self.candidates)[:3]}, ...and more")
print(f"Top score: {top_score}")
def weed(self):
# retain only top-scoring candidates
top_score = max(self.candidates.values())
low_scorers = {k for k, v in self.candidates.items() if v < top_score}
for low_scorer in low_scorers:
del self.candidates[low_scorer]
Code Glossary
node_set: a set of labelled nodes of the form 'unique_node_id': {relation1, relation2, ..., relationN}. The set of relations for each node can contain either no relations or an arbitrary number.
node: a key-value pair consisting of a node_id (key) and set of relations (value)
relation: as used by the OP, this is a number. If two nodes both share relation 1 and they are neighbours in the sequence, it adds 1 to the quality of the sequence.
sequence: an ordered set of node ids (e.g. ['A', 'B', 'C'] that is associated with a quality score. The quality score is the sum of shared relations between nodes in the sequence. The output of the heuristic is the sequence or sequences with the highest quality score.
candidate: a sequence that is currently being investigated to see if it is of high quality.
Method
generate seed sequences by stable sorting on the presence or absence of each relation in a linked item
The initially presented order is also one of the seed sequences in case it is close to optimal
For each seed sequence, pairs of nodes are swapped looking for a higher quality score
Execute a "round" for each seed sequence. A round is a shellsort-like pass over the sequence, swapping pairs of nodes, at first at a distance, then narrowing the distance until there is a distance of 1 (swapping immediate neighbours.) Keep only those sequences whose quality is more than the current top quality score
If a new highest quality score was found in this round, weed out all but top-score candidates and repeat 4 using top scorers as seeds. Otherwise exit.
Tests and Test Data Generators
The heuristic has been tested using small node_sets, scaled data of a few hundred up to 10,000 nodes with very simple relationships, and a randomised, richly interconnected node_set more like the OP's test data generator. Perfect single-linked sequences, linked cycles (small subsequences that link within themselves, and to each other) and shuffling have been useful to pick up and fix weaknesses.
ABC_links = {
'A': {2, 3},
'B': {1, 2},
'C': {4}
}
nick_links = {
'B': {1, 2, 4},
'C': {4},
'A': {2, 3},
'D': {4},
'E': {3},
'F': {5, 6},
'G': {2, 4},
'H': {1},
}
unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)
def many_nodes_many_relations(node_count):
# data set with n nodes and random 0..n relations as per OP's requirement
relation_range = list(range(node_count))
relation_set = (
set(random.choices(relation_range, k=random.randint(0, node_count)))
for _ in range(sys.maxsize)
)
return scaled_data(node_count, relation_set)
def scaled_data(no_of_items, link_sequence_model):
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
# only use a copy of the original link sequence model-- otherwise the model could be exhausted
# or start mid-cycle
# https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}
def shuffled(items_list):
"""relies on Python 3.6+ dictionary insertion-ordered keys"""
shuffled_keys = list(items_list.keys())
random.shuffle(shuffled_keys)
return {k: items_list[k] for k in shuffled_keys}
cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)
linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)
large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)
tests = [
('ABC', 1, ABC_links, True),
('Nick 8-digit', 6, nick_links, True),
# ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
# ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
# ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
# ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
# ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
# ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
seq_length = len(list(la.node_set.keys()))
print()
print('===============================')
print(f'"{title}" (sequence length {seq_length})')
print('===============================')
la.create_initial_candidates()
print('Finding heuristic candidates...')
la.optimise_order(la.heuristic_candidates)
print(f'Optimal quality: {expected_quality}')
# print('Brute Force working...')
# la.optimise_order(la.brute_force_candidates)
Performance
The heuristic is more 'practical' than the brute force solution because it leaves out many possible combinations. It may be that a lower-quality sequence produced by an element swap is actually one step away from a much higher-quality score after one more swap, but such a case might be weeded out before it could be tested.
The heuristic appears to find optimal results for single-linked sequences or cyclical sequences linked head to tail. These have a known optimal solution and the heuristic finds that solution and they may be less complex and problematic than real data.
A big improvement came with the introduction of an "incremental" quality calculation which can quickly calculate the quality difference a two-element swap makes without recomputing the quality score for the entire sequence.
I was tinkering with your test program and came up with this solution, which gives me 0 failures. Feels like a heuristic though, it needs definitely more testing and test cases. The function assumes that the keys are unique, so no ['A', 'A', 'B', ...] lists and in the arguments dictionary are all elements present:
def dosomething(_, arguments):
m = {}
for k, v in arguments.items():
for i in v:
m.setdefault(i, []).append(k)
out, seen = [], set()
for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
for v in values:
if v not in seen:
out.append(v)
seen.add(v)
return out
EDIT: Misread Quality function, this maps to separable traveling salesman problems
For N nodes, P properties, and T total properties across all nodes, this should be able to be solved in O(N + P + T) or better, depending on the topology of the data.
Lets convert your problem to a graph, where the "distance" between any two nodes is -(number of shared properties). Nodes with no connections would be left unlinked. This will take at least O(T) to create the graph, and perhaps another O(N + P) to segment.
Your "sort order" is then translated to a "path" through the nodes. In particular, you want the shortest path.
Additionally, you will be able to apply several translations to improve the performance and usability of generic algorithms:
Segment graph into disconnected chunks and solve each of them independently
Renormalize all the values to start at 1..N instead of -N..-1 (per graph, but doesn't really matter, could add |number of properties| instead).
https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms
It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).
https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs
Weights Time complexity Author
ℝ+ O(V2) Dijkstra 1959
ℝ+ O((E + V) log V) Johnson 1977 (binary heap)
ℝ+ O(E + V log V) Fredman & Tarjan 1984 (Fibonacci heap)
ℕ O(E) Thorup 1999 (requires constant-time multiplication).

python float precision: will this increment work reliably?

I use the following code to dynamically generate a list of dictionaries of every combination of incremental probabilities associated with a given list of items, such that the probabilities sum to 1. For example, if the increment_divisor were 2 (leading to increment of 1/2 or 0.5), and the list contained 3 items ['a', 'b', 'c'], then the function should return
[{'a': 0.5, 'b': 0.5, 'c': 0.0},
{'a': 0.5, 'b': 0.0, 'c': 0.5},
{'a': 0.0, 'b': 0.5, 'c': 0.5},
{'a': 1.0, 'b': 0.0, 'c': 0.0},
{'a': 0.0, 'b': 1.0, 'c': 0.0},
{'a': 0.0, 'b': 0.0, 'c': 1.0}]
The code is as follows. The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0. I already know that python floats are imprecise, but I want to be sure that the last value in increments will be something very close to 1.0.
from collections import OrderedDict
from itertools import permutations
def generate_hyp_space(list_of_items, increment_divisor):
"""Generate list of OrderedDicts filling the hypothesis space.
Each OrderedDict is of the form ...
{ i1: 0.0, i2: 0.1, i3: 0.0, ...}
... where .values() sums to 1.
Arguments:
list_of_items -- items that receive prior weights
increment_divisor -- Increment by 1/increment_divisor. For example,
4 yields (0.0, 0.25, 0.5, 0.75, 1.0).
"""
_LEN = len(list_of_items)
if increment_divisor < _LEN: # permutations() returns None if this is True
print('WARN: increment_divisor too small, so was reset to '
'len(list_of_items).', file=sys.stderr)
increment_divisor = _LEN
increment_size = 1/increment_divisor
h_space = []
increments = []
incremental = 0.0
while incremental <= 1.0:
increments.append(incremental)
incremental += increment_size
for p in permutations(increments, _LEN):
if sum(p) == 1.0:
h_space.append(OrderedDict([(list_of_items[i], p[i])
for i in range(_LEN)]))
return h_space
How large can the increment_divisor be before the imprecision of float breaks the reliability of the script? (specifically, while incremental <= 1.0 and if sum(p) == 1.0)
This is a small example, but real use will involve much larger permutation space. Is there a more efficient/effective way to achieve this goal? (I already plan to implement a cache.) Would numpy datatypes be useful here for speed or precision?
The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0.
No, no, no. Just make a list of [0/x, 1/x, ..., (x-1)/x, x/x] by dividing each integer from 0 to x by x:
increments = [i/increment_divisor for i in range(increment_divisor+1)]
# or for Python 2
increments = [1.0*i/increment_divisor for i in xrange(increment_divisor+1)]
The list will always have exactly the right number of elements, no matter what rounding errors occur.
With NumPy, this would be numpy.linspace:
increments = numpy.linspace(start=0, stop=1, num=increment_divisor+1)
As for your overall problem, working in floats at all is probably a bad idea. You should be able to do the whole thing with integers and only divide by increment_divisor at the end, so you don't have to deal with floating-point precision issues in sum(p) == 1.0. Also, itertools.permutations doesn't do what you want, since it doesn't allow repeated items in the same permutation.
Instead of filtering permutations at all, you should use an algorithm based on the stars and bars idea to generate all possible ways to place len(list_of_items) - 1 separators between increment_divisor outcomes, and convert separator placements to probability dicts.
Thanks to #user2357112 for...
...pointing out the approach to work with ints until the last step.
...directing me to stars and bars approach.
I implemented stars_and_bars as a generator as follows:
def stars_and_bars(n, k, the_list=[]):
"""Distribute n probability tokens among k endings.
Generator implementation of the stars-and-bars algorithm.
Arguments:
n -- number of probability tokens (stars)
k -- number of endings/bins (bars+1)
"""
if n == 0:
yield the_list + [0]*k
elif k == 1:
yield the_list + [n]
else:
for i in range(n+1):
yield from stars_and_bars(n-i, k-1, the_list+[i])

Optimal algorithm for the comparison two dictionaries in Python 3

I have List of dictionaries like:
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
And other list of dictionaries like:
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
{'color':'blue','size':'S','weight':500,'length':150,'location':'USA', 'cost':1$}
{'color':'pink','size':'L','material':'cotton','location':'China','cost':5$},
{'cost':5$,'color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
So I need find 'cost' for each record in Stock from Prices. But may be a situation, when I don't find 100% coincidence of dict elements, and in this case I need most similar element and get it's "cost".
output=[{'ID':1,'cost':1$},{'ID':2,'cost':5$},...]
Please, prompt the optimal solution for this task. I think it's like Loop from highest to lowest compliance, when we try find record with max coincidence, and if not found - try less matching condition.
how about this
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'cost':'2$','color':'blue','size':'S','weight':500,'length':150,'location':'USA'},
{'cost':'5$','color':'pink','size':'L','material':'cotton','location':'China'},
{'cost':'15$','color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
Prices = [p for p in Prices if "cost" in p] #make sure that everything have a 'cost'
result = []
for s in Stock:
field = set(s.items())
best_match = max(Prices, key=lambda p: len( field.intersection(p.items()) ), default=None)
if best_match:
result.append( {"ID":s["ID"], "cost":best_match["cost"] } )
print(result)
#[{'ID': 1, 'cost': '5$'}, {'ID': 2, 'cost': '5$'}, {'ID': 3, 'cost': '15$'}]
to find the most similar entry I first transform the dict to a set then use max to find the largest intersection of a price with the stock that I'm checking using a lambda function for the key of max
it reminds me of fuzzy or neural network solutions,
[on python2]
anyway , here is a Numpy solution, :
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'cost':2,'color':'blue','size':'S','weight':500,'length':150,'location':'USA'},
{'cost':5,'color':'pink','size':'L','material':'cotton','location':'China'},
{'cost':15,'color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
import numpy as np
# replace non useful records.
for p in Prices:
if not(p.has_key('cost')):
Prices.remove(p)
def numerize(lst_of_dics):
r=[]
for d in lst_of_dics:
r1=[]
for n in ['color','size','material','weight','length','location']:
try:
if n=='color':
# it is 0s if unknown
# only 3 letters, should work ,bug!!!
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
elif n=='size':
z=[0,0,0]
r1+=[ord(d[n])]*3
elif n=='material':
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
elif n=='location':
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
else:
z=[0,0,0]
r1+=[d[n]]*3
except:
r1+=z
r.append(r1)
return r
St = numerize(Stock)
Pr = np.array(numerize(Prices))
output=[]
for i,s in enumerate(St):
s0 = np.reshape(s*Pr.shape[0],Pr.shape)
# stage 0: make one element array to subtract easily
s1 = abs(Pr -s0)
# abs diff
s2 = s1 * Pr.astype('bool') * s0.astype('bool')
# non-extentent does'nt mean match..
s21 = np.logical_not(Pr.astype('bool') * s0.astype('bool'))*25
s2 = s2+s21
# ignore the zero fields..(non-extentse)
s3 = np.sum(s2,1)
# take the smallest
s4 = np.where(s3==np.min(s3))[0][0]
c = Prices[s4]['cost']
#print c,i
output.append( {'ID':i+1 ,'cost':c})
print(output)
that gives me the next results (with many assumptions):
[{'cost': 15, 'ID': 1}, {'cost': 5, 'ID': 2}, {'cost': 15, 'ID': 3}]
Note, that this is correct comparison result based on Values and Kinds of properties
please up vote and check the answer if it satisfies you..

Categories