Related
I have a weighted graph with around 800 nodes, each with a number of connections ranging from 1 to around 300. I need to find the shortest (lowest cost) path between two nodes with some extra criteria:
The path must contain exactly five nodes.
Each node has an attribute (called position in the example code) that takes one of five values; the five nodes in the path must all have unique values for this attribute.
The algorithm needs to allow for 1-2 required nodes to be specified that the path must contain at some point in any order.
The algorithm needs to take less than 10 seconds to run, preferably as little time as possible while losing as little accuracy as possible.
My current solution in Python is to run a Depth-Limited Depth-First Search which recursively searches every possible path. To make this algorithm run in reasonable time I have introduced a limit to the number of neighbour nodes that are searched at each recursion level. This number can be lowered to decrease the computation time but at the cost of accuracy. Currently this algorithm is far too slow, with my most recent test coming in at 75 seconds with a neighbour limit of 30. If I decrease this neighbour limit any more, my testing shows that the accuracy of the algorithm begins to suffer badly. I am out of ideas on how to solve this problem while satisfying all of the above criteria. My code is as follows:
# The path must go from start -> end, be of length 5 and contain all nodes in middle
# Each node is represented as a tuple: (value, position)
def dfs(start, end, middle=[], path=Path(), best=Path([], math.inf)):
# If this is the first level of recursion, initialise the path variable
if len(path) == 0:
path = Path([start])
# If the max depth has been exceeded, check if the current node is the goal node
if len(path) >= depth:
# If it is, save the path
# Check that all required nodes have been visited
if len(path) == depth and start == end and path.cost < best.cost and all(x in path.path for x in middle):
# Store the new best path
best.path = path.path
best.cost = path.cost
return
# Run DFS on all of the neighbors of the node that haven't been searched already
# Use the weights of the neighbors as a heuristic; sort by lowest weight first
neighbors = sorted([x for x in graph.get(*start).connected_nodes], key=lambda x: graph.weight(start, x))
# Make sure that all neighbors haven't been visited yet and that their positions aren't already accounted for
positions = [x[1] for x in path.path]
# Only visit neighbouring nodes with unique positions and ids
filtered = [x for x in neighbors if x not in path.path and x[1] not in positions]
for neighbor in filtered[:neighbor_limit]:
if neighbor not in path.path:
dfs(neighbor, end, middle, Path(path.path + [neighbor], path.cost + graph.weight(start, neighbor)), best)
return best
Path Class:
class Path:
def __init__(self, path=[], cost=0):
self.path = path
self.cost = cost
def __len__(self):
return len(self.path)
Any help in improving this algorithm or even suggestions on a better approach to the problem would be much appreciated, thanks in advance!
You should iterate over all possible orderings of the 'position' attribute, and for each one use Dijkstra's algorithm or BFS to find the shortest path that respects that ordering.
Since you know the position of the first and last nodes, there are only 3! = 6 different orderings for the intermediate nodes, so you only have to run Dijkstra's algorithm 6 times.
Even in python, this shouldn't take more than a couple hundred milliseconds to run, based on the input sizes you provided.
I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).
This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)
I have a very large graph represented in a text file of size about 1TB with each edge as follows.
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example
http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?
If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
Here is some Python code that implements a simple in-memory version of disjoint set forests:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.
If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort command included with Windows and Unix can sort files much larger than memory):
Choose some threshold vertex k.
Read the original file and write each of its edges to one of 3 files:
To a if its maximum-numbered vertex is < k
To b if its minimum-numbered vertex is >= k
To c otherwise (i.e. if it has one vertex < k and one vertex >= k)
If a is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbers x y and which is sorted by x. Each x is a vertex number and y is its representative -- the lowest-numbered vertex in the same component as x.
Do likewise for b.
Sort edges in c by their smallest-numbered endpoint.
Go through each edge in c, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblem a. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblem a. Call the resulting file d.
Sort edges in d by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.)
Go through each edge in d, renaming the endpoint that is >= k to its representative, found from the solution to the subproblem b using a linear-time merge as before. Call the resulting file e.
Solve e. (As with a and b, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges in e already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component.
Go through each line in the solution for subproblem a, updating the representative for this vertex using the solution to subproblem e if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it from a's solution) to a file f.
Do likewise for each line in the solution for subproblem b, creating file g.
Concatenate f and g to produce the final answer. (For better efficiency, just have step 11 append its results directly to f).
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).
External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.
I have looked at some common tools like Heapy to measure how much memory is being utilized by each traversal technique but I don't know if they are giving me the right results. Here is some code to give the context.
The code simply measures the number of unique nodes in a graph. Two traversal techniques provided viz. count_bfs and count_dfs
import sys
from guppy import hpy
class Graph:
def __init__(self, key):
self.key = key #unique id for a vertex
self.connections = []
self.visited = False
def count_bfs(start):
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
parents = children
children = []
return count
def count_dfs(start):
if not start.visited:
start.visited = True
else:
return 0
n = 1
for connection in start.connections:
n += count_dfs(connection)
return n
def construct(file, s=1):
"""Constructs a Graph using the adjacency matrix given in the file
:param file: path to the file with the matrix
:param s: starting node key. Defaults to 1
:return start vertex of the graph
"""
d = {}
f = open(file,'rU')
size = int(f.readline())
for x in xrange(1,size+1):
d[x] = Graph(x)
start = d[s]
for i in xrange(0,size):
l = map(lambda x: int(x), f.readline().split())
node = l[0]
for child in l[1:]:
d[node].connections.append(d[child])
return start
if __name__ == "__main__":
s = construct(sys.argv[1])
#h = hpy()
print(count_bfs(s))
#print h.heap()
s = construct(sys.argv[1])
#h = hpy()
print(count_dfs(s))
#print h.heap()
I want to know by what factor is the total memory utilization different in the two traversal techniques viz. count_dfs and count_bfs? One might have the intuition that dfs may be expensive as a new stack is created for every function call. How can the total memory allocations in each traversal technique be measured?
Do the (commented) hpy statements give the desired measure?
Sample file with connections:
4
1 2 3
2 1 3
3 4
4
This being a python question, it may be more important how much stack space is used than how much total memory. Cpython has a low limit of 1000 frames because it shares its call stack with the c call stack, which in turn is limited to the order of one megabyte in most places. For this reason you should almost* always prefer iterative solutions to recursive ones when the recursion depth is unbounded.
* other implementations of python may not have this restriction. The stackless variants of cpython and pypy have this exact property
For your specific problem, I don't know if there's going to be an easy solution. That's because, the peak memory usage of a graph traversal depends on the details of the graph itself.
For a depth-first traversal, the greatest usage will come when the algorithm has gone to the deepest depth. In your example graph, it will traverse 1->2->3->4, and create a stack frame for each level. So while it is at 4 it has allocated the most memory.
For the breadth-first traversal, the memory used will be proportional to the number of nodes at each depth plus the number of child nodes at the next depth. Those values are stored in lists, which are probably more efficient than stack frames. In the example, since the first node is connected to all the others, it happens immediately during the first step [1]->[2,3,4].
I'm sure there are some graphs that will do much better with one search or the other.
For example, imagine a graph that looked like a linked list, with all the vertices in a single long chain. The depth-first traversal will have a very-high peak memory useage, since it will recurse all the way down the chain, allocating a stack frame for each level. The breadth-first traversal will use much less memory, since it will only have a single vertex with a single child to keep track of on each step.
Now, contrast that with a graph that is a depth-2 tree. That is, there's a single root element that is connected to a great many children, none of which are connected to each other. The depth first traversal will not use much memory at any given time, as it will only need to traverse two nodes before it has to back up and try another branch. The depth-first traversal on the other hand will be putting all of the child nodes in memory at once, which for a big tree could be problematic.
Your current profiling code won't find the peak memory usage you want, because it only finds the memory used by objects on the heap at the time you call heap. That's likely to be the the same before and after your traversals. Instead, you'll need to insert profiling code into the traversal functions themselves. I can't find a pre-built package of guppy to try it myself, but I think this untested code will work:
from guppy import hpy
def count_bfs(start):
hp = hpy()
base_mem = hpy.heap().size
max_mem = 0
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
mem = hpy.heap().size - base_mem
if mem > max_mem:
max_mem = mem
parents = children
children = []
return count, max_mem
def count_dfs(start, hp=hpy(), base_mem=None):
if base_mem is None:
base_mem = hp.heap().size
if not start.visited:
start.visited = True
else:
return 0, hp.heap().size - base_mem
n = 1
max_mem = 0
for connection in start.connections:
c, mem = count_dfs(connection, base_mem)
if mem > max_mem:
max_mem = mem
n += c
return n, max_mem
Both traversal functions now return a (count, max-memory-used) tuple. You can try them out on a variety of graphs to see what the differences are.
It's tough to measure exactly how much memory is being used because systems vary in how they implement stack frames. Generally speaking, recursive algorithms use far more memory than iterative algorithms because each stack frame must store the state of its variables whenever a new function call occurs. Consider the difference between dynamic programming solutions and recursive solutions. Runtime is far faster on an iterative implementation of an algorithm than a recursive one.
If you really must know how much memory your code uses, load your software in a debugger such as OllyDbg (http://www.ollydbg.de/) and count the bytes. Happy coding!
Of the two, depth-first uses less memory if most traversals end up hitting most of the graph.
Breadth-first can be better when the target is near the starting node, or when the number of nodes doesn't go up very quickly so the parents/children arrays in your code stay small (e.g. another answer mentioned linked list as worst-case for DFS).
If the graph you're searching is spatial data, or has what's known as an "admissible heuristic," A* is another algorithm that's pretty good: http://en.wikipedia.org/wiki/A_star
However, premature optimization is the root of all evil. Look at the actual data you want to use; if it fits in a reasonable amount of memory, and the search runs in a reasonable time, it doesn't matter which algorithm you use. NB, what's "reasonable" depends on the application you're using it for and the amount of resources on the hardware that will be running it.
For either search order implemented iteratively with the standard data structure describing it (queue for BFS, stack for DFS), I can construct a graph that uses O(n) memory trivially. For BFS, it's an n-star, and for DFS it's an n-chain. I don't believe either of them can be implemented for the general case to do better than that, so that also gives an Omega(n) lower bound on maximum memory usage. So, with efficient implementations of each, it should generally be a wash.
Now, if your input graphs have some characteristics that bias them more toward one of those extremes or the other, that might inform your decision on which to use in practice.
Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657