i have a list like this.
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
I'll be given x (eg: 'a1'). I have to find the co-occurrence of a1 with every other element and sort it and retrieve the top n (eg: top 2)
my answer should be
[
{'product_id': 'b2', 'count': 4},
{'product_id': 'c3', 'count': 3},
]
my current code looks like this:
def compute (x):
set_a = list(set(list(itertools.chain(*a))))
count_dict = []
for i in range(0, len(set_a)):
count = 0
for j in range(0, len(a)):
if x==set_a[i]:
continue
if x and set_a[i] in a[j]:
count+=1
if count>0:
count_dict.append({'product_id': set_a[i], 'count': count})
count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
return count_dict
And it works beautifully for smaller inputs. However my actual input has 70000 unique items instead of 5 (a to e) and 1.3 million rows instead of 5. And hence mxn becomes very exhaustive. Is there a faster way to do this?
"Faster" is a very general term. Do you need a shorter total processing time, or shorter response time for a request? Is this for only one request, or do you want a system that handles repeated inputs?
If what you need is the fastest response time for repeated inputs, then convert this entire list of lists into a graph, with each element as a node, and the edge weight being the number of occurrences between the two elements. You make a single pass over the data to build the graph. For each node, sort the edge list by weight. From there, each request is a simple lookup: return the weight of the node's top edge, which is a hash (linear function) and two direct access operations (base address + offset).
UPDATE after OP's response
"fastest response" seals the algorithm, then. What you want to have is a simple dict, keyed by each node. The value of each node is a sorted list of related elements and their counts.
A graph package (say, networkx) will give you a good entry to this, but may not retain a node's edges in fast form, nor sorted by weight. Instead, pre-process your data base. For each row, you have a list of related elements. Let's just look at the processing for some row in the midst of the data set; call the elements a5, b2, z1, and the dict d. Assume that a5, b2 is already in your dict.
using `intertools`, Iterate through the six pairs.
(a5, b2):
d[a5][b2] += 1
(a5, z1):
d[a5][z1] = 1 (creates a new entry under a5)
(b2, a5):
d[b2][a5] += 1
(b2, z1):
d[b2][z1] = 1 (creates a new entry under b2)
(z1, a5):
d[z1] = {} (creates a new z1 entry in d)
d[z1][a5] = 1 (creates a new entry under z1)
(z1, b2):
d[z1][b2] = 1 (creates a new entry under z1)
You'll want to use defaultdict to save you some hassle to detect and initialize new entries.
With all of that handled, you now want to sort each of those sub-dicts into order based on the sub-level values. This leaves you with an ordered sequence for each element. When you need to access the top n connected elements, you go straight to the dict and extract them:
top = d[elem][:n]
Can you finish the coding from there?
as mentioned by #prune it is not mentioned that do you want a shorter processing time or shorter response time.
So I will explain two approach to this problem
The optimised code approach (for less processing time)
from heapq import nlargest
from operator import itemgetter
#say we have K THREADS
def compute (x, top_n=2):
# first you find the unique items and save them somewhere easily accessible
set_a = list(set(list(itertools.chain(*a))))
#first find that in which of your ROWS the x exists
selected_rows=[]
for i,row in enumerate(a): #this whole loop can be parallelized
if x in row:
selected_rows.append(i) #append index of the row in selected_rows array
# time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row # can be evaluated independently M items can be evaluated independently
# THE M rows can be run in parallel, if we have K threads
# it is going to take us (M/K)*N time complexity to run it.
count_dict=[]
# now the same thing you did earlier but now in second loop we are looking for less rows
for val in set_a:
if val==x:
continue
count=0
for ri in selected_rows: # this whole part can be parallelized as well
if val in a[ri]:
count+=1
count_dict.append({'product_id':val, 'count': count})
# if our selected rows size on worst case is M itself
# and our unique values are U, the complexity
# will be (U/K)*(M/K)*N
res = nlargest(top_n, count_dict, key = itemgetter('count'))
return res
Lets calculate time complexity here
If we have K threads then
O((M/K)*N)+O((U/K)*(M/K)*N))
where
M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
Graph approach as suggested by Prune
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
big_dictionary[x]=[]
for y in set_a:
count=0
if x==y:
continue
for arr in a:
if (x in arr) and (y in arr):
count+=1
big_dictionary[x].append((y,count))
for x in big_dictionary:
big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)
Lets calculate time complexity for this one here
One time complexity will be:
O(U*U*M*N)
where
M---> Total rows
N---> Total Columns
U---> Unique Values
But once this big_dictionary is calculated once,
It will take you just 1 step to get your topN values
For example if we want to get top3 values for a1
result=big_dictionary['a1'][:3]
I followed the defaultdict approach as suggested by #Prune above. Here's the final code:
from collections import defaultdict
def recommender(input_item, b_list, n):
count =[]
top_items = []
for x in b.keys():
lst_2 = b[x]
common_transactions = len(set(b_list) & set(lst_2))
count.append(common_transactions)
top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
top_values_counts = [count[i] for i in top_ids]
key_list = list(b.keys())
for i,v in enumerate(top_ids):
item_id = key_list[v]
top_items.append({item_id: top_values_counts[i]})
print(top_items)
return top_items
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
b = defaultdict(list)
for i, s in enumerate(a):
for key in s :
b[key].append(i)
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)
Here's the output for top 3 for 'a1':
[{'b2': 4}, {'c3': 3}, {'d4': 2}]
Thanks!!!
I have a list of keys:
['A', 'B', 'C']
For each of those keys there's a list of properties:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
I wish to sort the list of labels such that neighboring labels share as many of the properties as possible.
In the above example A and B share the relation 2, so they should be next to each other - whereas C shares nothing with them, so it can go anywhere.
So the possible orders for this example would be as follows:
["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable
Buckets
Actually I would prefer this to be represented by putting them into "buckets":
[["A", "B"], ["C"]] # this can represent all four possible orders above.
However, this gets problematic if a label belongs to two different buckets:
{
'A': [2,3],
'B': [1,2],
'C': [1,4]
}
How would I represent that?
I could put it like this:
[["A", "B"], ["C", "B"]]
But then I need another processing step to turn the list of buckets
into the final representation:
["A", "B", "C"]
And above that there could be recursively nested buckets:
[[["A","B"], ["C"]], ["D"]]
And then these could overlap:
[[["A","B"], ["C"]], ["A","D"]]
Quality
The "closeness", i.e. quality of a solution is defined as the sum of the intersection of relations between neighbors (the higher the quality the better):
def measurequality(result,mapping):
lastKey = None
quality = 0
for key in result:
if lastKey is None:
lastKey = key
continue
quality += len(set(mapping[key]).intersection(mapping[lastKey]))
lastKey = key
return quality
# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
# {
# 'A': [2,3],
# 'B': [1,2],
# 'C': [4]
# })
Brute-Forcing
Brute-forcing does not constitute a solution (in practice the list contains on the order of several thousand elements - though, if anyone got a brute-forcing approach that is better than O(n²)...).
However, using brute-forcing to create additional test cases is possible:
produce a list L of n items ['A','B','C',...]
produce for each item a dictionary R of relations (up to n random numbers between 0 and n should be sufficient).
produce all possible permutations of L and feed them together with R into measurequality() and keep those with maximal return value (might not be unique).
Code for creating random testcases to test the implementation:
import string
import random
def randomtestcase(n):
keys=list(string.ascii_uppercase[0:n])
minq = 0
maxq = 0
while minq == maxq:
items={}
for key in keys:
items[key] = random.sample(range(1,10),int(random.random()*10))
minq = n*n
minl = list(keys)
maxq = 0
maxl = list(keys)
for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
random.shuffle(keys)
q = measurequality(keys,items)
if q < minq:
minq = q
minl = list(keys)
if maxq < q:
maxq = q
maxl = list(keys)
return ( items, minl, maxq )
( items, keys, quality ) = randomtestcase(5)
sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))
Attempt
One of the many "solutions" that didn't work (very broken, this one doesn't have the selection of initial element / choice between prepending and appending to the result list that I had in others):
def dosomething(keys,items):
result = []
todo = list(keys)
result.append(todo.pop())
while any(todo):
lastItems = set(items[result[-1]])
bestScore = None
bestKey = None
for key in todo:
score = set(items[key]).intersection(lastItems)
if bestScore is None or bestScore < score:
bestScore = score
bestKey = key
todo.remove(bestKey)
result.append(bestKey)
return result
Examples
(Also check out the example generator in the section Brute-Forcing above.)
Testing code trying some examples:
def test(description,acceptable,keys,arguments):
actual = dosomething(keys,arguments)
if "".join(actual) in acceptable:
return 0
print("\n[{0}] {1}".format("".join(keys),description))
print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
print("Quality of result: {0}".format(measurequality(actual,arguments)))
print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
return 1
print("EXAMPLES")
failures = 0
# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
["A","B","C"],
["A","C","B"],
["B","A","C"],
["B","C","A"],
["C","A","B"],
["C","B","A"]
]:
failures += test(
"1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
["ABC", "BAC", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [4]
})
failures += test(
"2. They all have 2, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [2]
})
failures += test(
"3. A and B share 2, B and C share 1, so B must be in the middle.",
["ABC", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1]
})
failures += test(
"4. Each shares something with each other, creating a cycle, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1,3]
})
if 0 < failures:
print("{0} FAILURES".format(failures))
Precedence
As it was asked: the numbers used for the relations aren't in an order of precedence. An order of precedence exists, but it's a partial order and not the one of the numbers. I just didn't mention it because it makes the problem harder.
So given this example:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
Might be replaced by the following (using letters instead of numbers and adding precedence information):
{
'A': [('Q',7),('R',5)],
'B': [('P',6),('Q',6)],
'C': [('S',5)]
}
Note that
The precedence is meaningful only within a list, not across lists.
The precedence of shared relations might be different between two lists.
Within a list there might be the same precedence several times.
This is a Travelling Salesman Problem, a notoriously hard problem to solve optimally. The code presented solves for 10,000 nodes with simple interconnections (i.e. one or two relations each) in about 15 minutes. It performs less well for sequences that are more richly interconnected. This is explored in the test results below.
The idea of precedence, mentioned by the OP, is not explored.
The presented code consists of a heuristic solution, a brute-force solution that is optimal but not practical for large node_sets, and some simple but scalable test data generators, some with known optimal solutions. The heuristic finds optimal solutions for the OP's 'ABC' example, my own 8-item node_set, and for scalable test data for which optimal solutions are known.
If the performance is not good enough, at least it is a bit of a first effort and has the beginnings of a test 'workshop' to improve the heuristic.
Test Results
>>> python sortbylinks.py
===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1
===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6
Short, relatively trivial cases appear to be no problem.
===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
The "Quality <1000, cycling subsequence" (sequence length 501) is interesting. By grouping nodes with {0, 1} relation sets the quality score can be nearly doubled. The heuristic finds this optimal sequence. Quality 1000 is not quite possible because these double-linked groups need attaching to each other via a single-linked node every so often (e.g. {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}).
Performance is still good for this test data with few relations per node.
"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
One of the difficulties with Travelling Salesman Problems (TSPs) is knowing when you have an optimal solution. The heuristic doesn't seem to converge any faster even from a near-optimal or optimal start.
===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000
When there are very small numbers of relations, even if there are many nodes, performance is pretty good and the results may be close to optimal.
===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?
===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?
This is more like the data generated by the OP, and also more like the classical Travelling Salesman Problem (TSP) where you have a set of distances between each city pair (for 'city' read 'node') and nodes are typically richly connected to each other. In this case the links between nodes is partial-- there is no guarantee of a link between any 2 nodes.
The heuristic's time performance is much worse in cases like this. There are between 0 and n random relations for each node, for n nodes. This is likely to mean many more swap combinations yield improved quality, swaps and quality checks are more expensive in themselves and many more passes will be needed before the heuristic converges on its best result. This may mean O(n^3) in the worst case.
Performance degrades as the number of nodes and relations increases (note the difference between n=200-- 3 minutes-- and n=500-- 70 minutes.) So currently the heuristic may not be practical for several thousand richly-interconnected nodes.
In addition, the quality of the result for this test can't be known precisely because a brute-force solution is not computationally feasible. 6861 / 200 = 34.3 and 41999 / 500 = 84.0 average connections between node pairs doesn't look too far off.
Code for the Heuristic and Brute Force Solvers
import sys
from collections import deque
import itertools
import random
import datetime
# TODO: in-place swapping? (avoid creating copies of sequences)
def timing(f):
"""
decorator for displaying execution time for a method
"""
def wrap(*args):
start = datetime.datetime.now()
f_return_value = f(*args)
end = datetime.datetime.now()
print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
return f_return_value
return wrap
def flatten(a):
# e.g. [a, [b, c], d] -> [a, b, c, d]
return itertools.chain.from_iterable(a)
class LinkAnalysis:
def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
"""
:param node_set: node_ids and their relation sets to be arranged in sequence
:param max_ram: estimated maximum RAM to use
:param generate_seeds: if true, attempt to generate some initial candidates based on sorting
"""
self.node_set = node_set
self.candidates = {}
self.generate_seeds = generate_seeds
self.seeds = {}
self.relations = []
# balance performance and RAM using regular 'weeding'
candidate_size = sys.getsizeof(deque(self.node_set.keys()))
self.weed_interval = max_ram // candidate_size
def create_initial_candidates(self):
print('Working...')
self.generate_seed_from_presented_sequence()
if self.generate_seeds:
self.generate_seed_candidates()
def generate_seed_from_presented_sequence(self):
"""
add initially presented order of nodes as one of the seed candidates
this is worthwhile because the initial order may be close to optimal
"""
presented_sequence = self.presented_sequence()
self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)
def presented_sequence(self) -> list:
return list(self.node_set.keys()) # relies on Python 3.6+ to preserve key order in dicts
def generate_seed_candidates(self):
initial_sequence = self.presented_sequence()
# get relations from the original data structure
relations = sorted(set(flatten(self.node_set.values())))
# sort by lowest precedence relation first
print('...choosing seed candidates')
for relation in reversed(relations):
# use true-false ordering: in Python, True > False
initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
sq = self.quality(initial_sequence)
self.seeds[tuple(initial_sequence)] = sq
def quality(self, sequence):
"""
calculate quality of full sequence
:param sequence:
:return: quality score (int)
"""
pairs = zip(sequence[:-1], sequence[1:])
scores = [len(self.node_set[a].intersection(self.node_set[b]))
for a, b in pairs]
return sum(scores)
def brute_force_candidates(self, sequence):
for sequence in itertools.permutations(sequence):
yield sequence, self.quality(sequence)
def heuristic_candidates(self, seed_sequence):
# look for solutions with higher quality scores by swapping elements
# start with large distances between elements
# then reduce by power of 2 until swapping next-door neighbours
max_distance = len(seed_sequence) // 2
max_pow2 = int(pow(max_distance, 1/2))
distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
for distance in distances:
yield from self.seed_and_variations(seed_sequence, distance)
# seed candidate may be better than its derived sequences -- include it as a candidate
yield seed_sequence, self.quality(seed_sequence)
def seed_and_variations(self, seed_sequence, distance=1):
# swap elements at a distance, starting from beginning and end of the
# sequence in seed_sequence
candidate_count = 0
for pos1 in range(len(seed_sequence) - distance):
pos2 = pos1 + distance
q = self.quality(seed_sequence)
# from beginning of sequence
yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
# from end of sequence
yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
candidate_count += 2
if candidate_count > self.weed_interval:
self.weed()
candidate_count = 0
def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
"""
swap and return quality (which can easily be calculated from present quality
:param sequence: as for swap
:param pos1: as for swap
:param pos2: as for swap
:param preswap_sequence_q: quality of pre-swapped sequence
:return: swapped sequence, quality of swapped sequence
"""
initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
swapped_sequence = self.swap(sequence, pos1, pos2)
swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
qdelta = swapped_node_q - initial_node_q
swapped_sequence_q = preswap_sequence_q + qdelta
return swapped_sequence, swapped_sequence_q
def swap(self, sequence, node_pos1: int, node_pos2: int):
"""
deques perform better than lists for swapping elements in a long sequence
:param sequence-- sequence on which to perform the element swap
:param node_pos1-- zero-based position of first element
:param pos2--: zero-based position of second element
>>> swap(('A', 'B', 'C'), 0, 1)
('B', 'A', 'C')
"""
if type(sequence) is tuple:
# sequence is a candidate (which are dict keys and hence tuples)
# needs converting to a list for swap processing
sequence = list(sequence)
if node_pos1 == node_pos2:
return sequence
tmp = sequence[node_pos1]
sequence[node_pos1] = sequence[node_pos2]
sequence[node_pos2] = tmp
return sequence
def node_quality(self, sequence, pos):
if pos < 0:
pos = len(sequence) + pos
no_of_links = 0
middle_node_relations = self.node_set[sequence[pos]]
if pos > 0:
left_node_relations = self.node_set[sequence[pos - 1]]
no_of_links += len(left_node_relations.intersection(middle_node_relations))
if pos < len(sequence) - 1:
right_node_relations = self.node_set[sequence[pos + 1]]
no_of_links += len(middle_node_relations.intersection(right_node_relations))
return no_of_links
#timing
def optimise_order(self, selection_strategy):
top_score = 0
new_top_score = True
self.candidates.update(self.seeds)
while new_top_score:
top_score = max(self.candidates.values())
new_top_score = False
initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
for initial_candidate in initial_candidates:
for candidate, q in selection_strategy(initial_candidate):
if q > top_score:
new_top_score = True
top_score = q
self.candidates[tuple(candidate)] = q
self.weed()
print(f"Number of candidates with top score: {len(list(self.candidates))}")
print(f"{list(self.candidates)[:3]}, ...and more")
print(f"Top score: {top_score}")
def weed(self):
# retain only top-scoring candidates
top_score = max(self.candidates.values())
low_scorers = {k for k, v in self.candidates.items() if v < top_score}
for low_scorer in low_scorers:
del self.candidates[low_scorer]
Code Glossary
node_set: a set of labelled nodes of the form 'unique_node_id': {relation1, relation2, ..., relationN}. The set of relations for each node can contain either no relations or an arbitrary number.
node: a key-value pair consisting of a node_id (key) and set of relations (value)
relation: as used by the OP, this is a number. If two nodes both share relation 1 and they are neighbours in the sequence, it adds 1 to the quality of the sequence.
sequence: an ordered set of node ids (e.g. ['A', 'B', 'C'] that is associated with a quality score. The quality score is the sum of shared relations between nodes in the sequence. The output of the heuristic is the sequence or sequences with the highest quality score.
candidate: a sequence that is currently being investigated to see if it is of high quality.
Method
generate seed sequences by stable sorting on the presence or absence of each relation in a linked item
The initially presented order is also one of the seed sequences in case it is close to optimal
For each seed sequence, pairs of nodes are swapped looking for a higher quality score
Execute a "round" for each seed sequence. A round is a shellsort-like pass over the sequence, swapping pairs of nodes, at first at a distance, then narrowing the distance until there is a distance of 1 (swapping immediate neighbours.) Keep only those sequences whose quality is more than the current top quality score
If a new highest quality score was found in this round, weed out all but top-score candidates and repeat 4 using top scorers as seeds. Otherwise exit.
Tests and Test Data Generators
The heuristic has been tested using small node_sets, scaled data of a few hundred up to 10,000 nodes with very simple relationships, and a randomised, richly interconnected node_set more like the OP's test data generator. Perfect single-linked sequences, linked cycles (small subsequences that link within themselves, and to each other) and shuffling have been useful to pick up and fix weaknesses.
ABC_links = {
'A': {2, 3},
'B': {1, 2},
'C': {4}
}
nick_links = {
'B': {1, 2, 4},
'C': {4},
'A': {2, 3},
'D': {4},
'E': {3},
'F': {5, 6},
'G': {2, 4},
'H': {1},
}
unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)
def many_nodes_many_relations(node_count):
# data set with n nodes and random 0..n relations as per OP's requirement
relation_range = list(range(node_count))
relation_set = (
set(random.choices(relation_range, k=random.randint(0, node_count)))
for _ in range(sys.maxsize)
)
return scaled_data(node_count, relation_set)
def scaled_data(no_of_items, link_sequence_model):
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
# only use a copy of the original link sequence model-- otherwise the model could be exhausted
# or start mid-cycle
# https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}
def shuffled(items_list):
"""relies on Python 3.6+ dictionary insertion-ordered keys"""
shuffled_keys = list(items_list.keys())
random.shuffle(shuffled_keys)
return {k: items_list[k] for k in shuffled_keys}
cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)
linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)
large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)
tests = [
('ABC', 1, ABC_links, True),
('Nick 8-digit', 6, nick_links, True),
# ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
# ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
# ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
# ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
# ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
# ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
seq_length = len(list(la.node_set.keys()))
print()
print('===============================')
print(f'"{title}" (sequence length {seq_length})')
print('===============================')
la.create_initial_candidates()
print('Finding heuristic candidates...')
la.optimise_order(la.heuristic_candidates)
print(f'Optimal quality: {expected_quality}')
# print('Brute Force working...')
# la.optimise_order(la.brute_force_candidates)
Performance
The heuristic is more 'practical' than the brute force solution because it leaves out many possible combinations. It may be that a lower-quality sequence produced by an element swap is actually one step away from a much higher-quality score after one more swap, but such a case might be weeded out before it could be tested.
The heuristic appears to find optimal results for single-linked sequences or cyclical sequences linked head to tail. These have a known optimal solution and the heuristic finds that solution and they may be less complex and problematic than real data.
A big improvement came with the introduction of an "incremental" quality calculation which can quickly calculate the quality difference a two-element swap makes without recomputing the quality score for the entire sequence.
I was tinkering with your test program and came up with this solution, which gives me 0 failures. Feels like a heuristic though, it needs definitely more testing and test cases. The function assumes that the keys are unique, so no ['A', 'A', 'B', ...] lists and in the arguments dictionary are all elements present:
def dosomething(_, arguments):
m = {}
for k, v in arguments.items():
for i in v:
m.setdefault(i, []).append(k)
out, seen = [], set()
for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
for v in values:
if v not in seen:
out.append(v)
seen.add(v)
return out
EDIT: Misread Quality function, this maps to separable traveling salesman problems
For N nodes, P properties, and T total properties across all nodes, this should be able to be solved in O(N + P + T) or better, depending on the topology of the data.
Lets convert your problem to a graph, where the "distance" between any two nodes is -(number of shared properties). Nodes with no connections would be left unlinked. This will take at least O(T) to create the graph, and perhaps another O(N + P) to segment.
Your "sort order" is then translated to a "path" through the nodes. In particular, you want the shortest path.
Additionally, you will be able to apply several translations to improve the performance and usability of generic algorithms:
Segment graph into disconnected chunks and solve each of them independently
Renormalize all the values to start at 1..N instead of -N..-1 (per graph, but doesn't really matter, could add |number of properties| instead).
https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms
It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).
https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs
Weights Time complexity Author
ℝ+ O(V2) Dijkstra 1959
ℝ+ O((E + V) log V) Johnson 1977 (binary heap)
ℝ+ O(E + V log V) Fredman & Tarjan 1984 (Fibonacci heap)
ℕ O(E) Thorup 1999 (requires constant-time multiplication).
This question already has answers here:
A weighted version of random.choice
(28 answers)
Closed 2 years ago.
I have a Python dictionary where keys represent some item and values represent some (normalized) weighting for said item. For example:
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
# Note that sum([v for k,v in d.iteritems()]) == 1 for all `d`
Given this correlation of items to weights, how can I choose a key from d such that 6.25% of the time the result is 'a', 32.25% of the time the result is 'b', and 62.5% of the result is 'c'?
def weighted_random_by_dct(dct):
rand_val = random.random()
total = 0
for k, v in dct.items():
total += v
if rand_val <= total:
return k
assert False, 'unreachable'
Should do the trick. Goes through each key and keeps a running sum and if the random value (between 0 and 1) falls in the slot it returns that key
Starting in Python 3.6, you can use the built-in random.choices() instead of having to use Numpy.
So then, if we want to sample (with replacement) 25 keys from your dictionary where the values are the weights/probabilities of being sampled, we can simply write:
import random
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=25)
This outputs a list of sampled keys:
['c', 'b', 'c', 'b', 'b', 'c', 'c', 'c', 'b', 'c', 'b', 'c', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'b']
If you just want one key, set k to 1 and extract the single element from the list that random.choices returns:
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=1)[0]
(If you don't convert my_dict.keys() to a list, you'll get a TypeError about how it's not subscriptable.)
Here's the relevant snippet from the docs:
random.choices(population, weights=None, *, cum_weights=None, k=1)
Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError.
If a weights sequence is specified, selections are made according to the relative weights. Alternatively, if a cum_weights sequence is given, the selections are made according to the cumulative weights (perhaps computed using itertools.accumulate()). For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]. Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.
If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights.
The weights or cum_weights can use any numeric type that interoperates with the float values returned by random() (that includes integers, floats, and fractions but excludes decimals). Weights are assumed to be non-negative.
For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
According to the comments at https://stackoverflow.com/a/39976962/5139284, random.choices is faster for small arrays, and numpy.random.choice is faster for big arrays. numpy.random.choice also provides an option to sample without replacement, whereas there's no built-in Python standard library function for that.
If you're planning to do this a lot, you could use numpy to select your keys from a list with weighted probabilities using np.random.choice(). The below example will pick your keys 10,000 times with the weighted probabilities.
import numpy as np
probs = [0.0625, 0.625, 0.3125]
keys = ['a', 'c', 'b']
choice_list = np.random.choice(keys, 10000, replace=True, p=probs)
Not sure what your use case is here, but you can check out the frequency distribution/probability distribution classes in the NLTK package, which handle all the nitty details.
FreqDist is an extension of a counter, which can be passed to a ProbDistI interface. The ProbDistI interface exposes a "generate()" method which can be used to sample the distribution, as well as a "prob(sample)" method that can be used to get the probability of a given key.
For your case you'd want to use Maximum Likelihood Estimation, so the MLEProbDist. If you want to smooth the distribution, you could try LaplaceProbDist or SimpleGoodTuringProbDist.
For example:
from nltk.probability import FreqDist, MLEProbDist
d = {'a': 6.25, 'c': 62.5, 'b': 31.25}
freq_dist = FreqDist(d)
prob_dist = MLEProbDist(freq_dist)
print prob_dist.prob('a')
print prob_dist.prob('b')
print prob_dist.prob('c')
print prob_dist.prob('d')
will print "0.0625 0.3125 0.625 0.0".
To generate a new sample, you can use:
prob_dist.generate()
If you are able to use numpy, you can use the numpy.random.choice function, like so:
import numpy as np
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
def pick_by_weight(d):
d_choices = []
d_probs = []
for k,v in d.iteritems():
d_choices.append(k)
d_probs.append(v)
return np.random.choice(d_choices, 1, p=d_probs)[0]
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
choice = pick_by_weight(d)
What i have understood: you need a simple random function that will generate a random number uniformly in between 0 and 1. If the value is in between say 0 to 0.0625, you will select key a, if it is in between 0.0625 and (0.0625 + 0.625), then you will select key c etc. This is what actually mentioned in this answer.
Since random numbers will be generated uniformly, it is expected that keys associated with larger weight will be selected more compared to others.