Hash value for directed acyclic graph

Hash value for directed acyclic graph - python

How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. We can assume that the number of vertices in the graph is at most 11.
I am particularly interested in Python code.
Here is what I did. If self.lt is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). Then, I hash the sorted dictionary. Some isomorphic graphs will hash to different values, especially as the number of nodes grows.
I have included all the code to motivate my use case. I am calculating the number of comparisons required to find the median of 7 numbers. The more that isomorphic graphs hash to the same value the less work that has to be redone. I considered putting larger connected components first, but didn't see how to do that quickly.
from tools.decorator import memoized # A standard memoization decorator
class Graph:
def __init__(self, n):
self.lt = {i: set() for i in range(n)}
def compared(self, i, j):
return j in self.lt[i] or i in self.lt[j]
def withedge(self, i, j):
retval = Graph(len(self.lt))
implied_lt = self.lt[j] | set([j])
for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
retval.lt.items()):
lt_k |= lt_s
if i in lt_k or k == i:
lt_k |= implied_lt
return retval.toposort()
def toposort(self):
mapping = {}
while len(mapping) < len(self.lt):
for i, lt_i in self.lt.items():
if i in mapping:
continue
if any(i in lt_j or len(lt_i) < len(lt_j)
for j, lt_j in self.lt.items()
if j not in mapping):
continue
mapping[i] = len(mapping)
retval = Graph(0)
for i, lt_i in self.lt.items():
retval.lt[mapping[i]] = {mapping[j]
for j in lt_i}
return retval
def median_known(self):
n = len(self.lt)
for i, lt_i in self.lt.items():
if len(lt_i) != n // 2:
continue
if sum(1
for j, lt_j in self.lt.items()
if i in lt_j) == n // 2:
return True
return False
def __repr__(self):
return("[{}]".format(", ".join("{}: {{{}}}".format(
i,
", ".join(str(x) for x in lt_i))
for i, lt_i in self.lt.items())))
def hashkey(self):
return tuple(sorted({k: tuple(sorted(v))
for k, v in self.lt.items()}.items()))
def __hash__(self):
return hash(self.hashkey())
def __eq__(self, other):
return self.hashkey() == other.hashkey()
#memoized
def mincomps(g):
print("Calculating:", g)
if g.median_known():
return 0
nodes = g.lt.keys()
return 1 + min(max(mincomps(g.withedge(i, j)),
mincomps(g.withedge(j, i)))
for i in nodes
for j in nodes
if j > i and not g.compared(i, j))
g = Graph(7)
print(mincomps(g))

To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs.
Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py:
diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py 2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py 2013-01-28 22:09:07.000000000 -0200
## -31,7 +31,9 ##
ext_pynauty = Extension(
name = MODULE + '._pynauty',
- sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+ sources = [ pynauty_dir + '/' + 'pynauty.c',
+ os.path.join(nauty_dir, 'schreier.c'),
+ os.path.join(nauty_dir, 'naurng.c')],
depends = [ pynauty_dir + '/' + 'pynauty.h', ],
extra_compile_args = [ '-O4' ],
extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c 2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c 2013-01-29 00:38:36.000000000 -0200
## -320,7 +320,7 ##
PyObject *adjlist;
PyObject *p;
- int i,j;
+ Py_ssize_t i, j;
int adjlist_length;
int x, y;

Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.
Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.
The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n2 bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.
How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2110 remaining graphs are acyclic and for each graph there may be up to 11! - roughly 225 - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?
How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.
There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.

How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:
the total number of nodes
the total number of (directed) connections
the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n))
All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.
As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c instead of a -> b).
This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:
same as above but with second order in- and outdegree. Ie. the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.
or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) )
number of nodes with eccentricity = x (again for any x)
if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. Due to the xor the specific order in which the nodes where added to the hash does not matter.
You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).
Of course this is much more complex though.

Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).
I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:
Sort the hashes of the node's neighbors
Hash the concatenated sorted hashes
On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).
For more on this you can look at my post here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

Imho, If the graph could be topologically sorted, the very straightforward solution exists.
For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (p.e. if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334})
For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N);
I think the complexity of this procedure is around (N*N) in the worst case. If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part)

I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.
A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:
1->2, 1->5
2->1, 2->4
3->4
5->3
What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}), here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3).
Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn), Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch.
In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5.
The alphabet(set of possible letters) for the set {1,2,3} would be(ordered lexicographically):
{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}
First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.
Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N. Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N. This is to show that the hash code grows really fast with the increase of N. Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.
There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++ but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:
typedef long long ll;
// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination.
ll code(vector<int> a,int n)
{
sort(a.begin(),a.end()); // not needed if the set you pass is already sorted.
int cur = 0;
int m = a.size();
ll res =0;
for(int i=0;i<a.size();i++)
{
if(a[i] == cur+1)
{
res++;
cur = a[i];
continue;
}
else
{
res++;
int number_of_greater_nums = n - a[i];
for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
res += 1LL << (number_of_greater_nums+increment);
cur = a[i];
}
}
return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the
// combination
vector<int> decode(ll kod, int n)
{
vector<int> res;
int cur = 0;
int left = n; // Out of how many numbers are we left to choose.
while(kod)
{
ll all = 1LL << left;// how many are the total combinations
for(int i=n;i>=0;i--)
{
if(all - (1LL << (n-i+1)) +1 <= kod)
{
res.push_back(i);
left = n-i;
kod -= all - (1LL << (n-i+1)) +1;
break;
}
}
}
return res;
}
Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.
Also have a look at this question.

I'm not sure that it's 100% working, but here is an idea:
Let's code a graph into a string and then take its hash.
hash of an empty graph is ""
hash of a vertex with no outgoing edges is "."
hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (e.g. ",")
To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (e.g. in lexicographical order).
For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots).
edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. That happens when a vertex has several parents - then it "duplicated" for every parent. For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}.
I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily:
THash GetHash(const TGraph &graph)
{
return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
std::vector<std::string> childHashes;
for(auto c:graph.GetChildren(vertex))
childHashes.push_back(GetVertexStringCode(graph,*c));
std::sort(childHashes.begin(),childHashes.end());
std::string result=".";
for(auto h:childHashes)
result+=*h+",";
return result;
}

I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. This proposal is therefore based on isomorphism only.
For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. Here is a starter list:
2d histogram of nodes' in and out degrees.
4d histogram of edges a->b where a and b are both characterized by in/out degree.
Addition
Let me be more explicit. For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I,O values), signifying that there are N nodes with in-degree I and out-degree O. You'd hash this set of triples or better yet use the whole set arranged in some canonical order e.g. lexicographically sorted. For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO, to nodes with bI and bO respectively. Again hash these quintuples or else use them in canonical order as-is for another part of the final hash.
Starting with this and then looking at collisions that still occur will probably provide insights on how to get better.

When I saw the question, I had essentially the same idea as #example. I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs.
This tag consists of the sequence of out-degrees in ascending order. You can hash this tag with the string hash function of your choice to obtain a hash of the graph.
Edit: I expressed my proposal in the context of #NeilG's original question. The only modification to make to his code is to redefine the hashkey function as:
def hashkey(self):
return tuple(sorted(map(len,self.lt.values())))

With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification.
Example code in this StackOverflow answer, the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent.
Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children.

Related

Algorithm to calculate the minimum swaps required to sort an array with duplicated elements in a given range?

Is there an efficient way to calculate the optimal swaps required to sort an array? The element of the array can be duplicated, and there is a given upper limit=3. (the elements can be in {1,2,3})
For example:
1311212323 -> 1111222333 (#swaps: 2)
Already found similar questions on Stackoverflow, however, we have new information about the upper limit, that can be useful in the algorithm.

Yes, the upper limit of 3 makes a big difference.
Let w(i, j) be the number of positions that contain i that should contain j. To find the optimal number of swaps, let w'(i, j) = w(i, j) - min(w(i, j), w(j, i)). The answer is (sum over i<j of min(w(i, j), w(j, i))) + (2/3) (sum over i!=j of w'(i, j)).
That this answer is an upper bound follows from the following greedy algorithm: if there are i!=j such that w(i, j) > 0 and w(j, i) > 0, then we can swap an appropriate i and j, costing us one swap but also lowering the bound by one. Otherwise, swap any two out of place elements. The first term of the answer goes up by one, and the second goes down by two. (I am implicitly invoking induction here.)
That this answer is a lower bound follows from the fact that no swap can decrease it by more than one. This follows from more tedious case analysis.
The reason that this answer doesn't generalize past (much past?) 3 is that the cycle structure gets more complicated. Still, for array entries bounded by k, there should be an algorithm whose exponential dependence is limited to k, with a polynomial dependence on n, the length of the arrays.

Does this shuffling algorithm produce each permutation with uniform probability?

I've seen how a particular naive shuffling algorithm is biased, and I feel like I basically get that, and I get how the Fischer-Yates algorithm is not biased. I have the following algorithm which was the one I first thought of when I thought about how to shuffle a list. I know it consumes twice the memory and runs in unnecessarily large time, but I'm still curious if it produces each permutation with a uniform distribution, or if there's some sneaky reason I'm not seeing for it to be biased.
I'm also kind of wondering if there is some other "undesirable" property to a random shuffle that this would have, like perhaps the probabilities of various positions in the list being filled with some values are dependent.
def shuf(x):
out = [None for i in range(len(x))]
for i in x:
pos = rand.randint(0,len(x)-1)
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
out[pos] = i
return out
I generated a heat map of this on a list of 20 elements, running 10^6 trials, and it produced the following. The (i,j) coordinate of the map represents the probability of the ith position of the list being filled with the jth element of the original list.
While I don't see any pattern to the heat map, it looks like the variance might be high. Or that might be the heat map over-stating the variance because, hey, the minimum and max have to come up somewhere.

Undesirable property - this can be expensive if you're shuffling a large set:
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
Imagine len(x) == 100,000,000 and you've placed 90,000,000 already - you're going to loop a LOT before you get a hit.
Interesting exercises:
What does the heat map look like for simply generating random numbers between 1 and len(x) over 10e6 iterations look like?
What does the heat map look like for Fischer-Yates, for comparison?
At a glance, it looks to me like, given a uniform RNG, it should yield a truly random distribution (albeit more slowly than Fischer-Yates).

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Finding components of very large graph

I have a very large graph represented in a text file of size about 1TB with each edge as follows.
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example
http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?

If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
Here is some Python code that implements a simple in-memory version of disjoint set forests:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.

If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort command included with Windows and Unix can sort files much larger than memory):
Choose some threshold vertex k.
Read the original file and write each of its edges to one of 3 files:
To a if its maximum-numbered vertex is < k
To b if its minimum-numbered vertex is >= k
To c otherwise (i.e. if it has one vertex < k and one vertex >= k)
If a is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbers x y and which is sorted by x. Each x is a vertex number and y is its representative -- the lowest-numbered vertex in the same component as x.
Do likewise for b.
Sort edges in c by their smallest-numbered endpoint.
Go through each edge in c, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblem a. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblem a. Call the resulting file d.
Sort edges in d by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.)
Go through each edge in d, renaming the endpoint that is >= k to its representative, found from the solution to the subproblem b using a linear-time merge as before. Call the resulting file e.
Solve e. (As with a and b, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges in e already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component.
Go through each line in the solution for subproblem a, updating the representative for this vertex using the solution to subproblem e if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it from a's solution) to a file f.
Do likewise for each line in the solution for subproblem b, creating file g.
Concatenate f and g to produce the final answer. (For better efficiency, just have step 11 append its results directly to f).
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).

External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))

The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).

This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.

Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice

We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.