Finding components of very large graph - python

I have a very large graph represented in a text file of size about 1TB with each edge as follows.
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example
http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?

If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
Here is some Python code that implements a simple in-memory version of disjoint set forests:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.

If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort command included with Windows and Unix can sort files much larger than memory):
Choose some threshold vertex k.
Read the original file and write each of its edges to one of 3 files:
To a if its maximum-numbered vertex is < k
To b if its minimum-numbered vertex is >= k
To c otherwise (i.e. if it has one vertex < k and one vertex >= k)
If a is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbers x y and which is sorted by x. Each x is a vertex number and y is its representative -- the lowest-numbered vertex in the same component as x.
Do likewise for b.
Sort edges in c by their smallest-numbered endpoint.
Go through each edge in c, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblem a. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblem a. Call the resulting file d.
Sort edges in d by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.)
Go through each edge in d, renaming the endpoint that is >= k to its representative, found from the solution to the subproblem b using a linear-time merge as before. Call the resulting file e.
Solve e. (As with a and b, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges in e already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component.
Go through each line in the solution for subproblem a, updating the representative for this vertex using the solution to subproblem e if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it from a's solution) to a file f.
Do likewise for each line in the solution for subproblem b, creating file g.
Concatenate f and g to produce the final answer. (For better efficiency, just have step 11 append its results directly to f).
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).

External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.

Related

What is the time efficient way to calculate Global efficiency for large networks?

I have a network with 30000 nodes and over 40000 edges. I tried to calculate Global efficiency for my network with networkx but it's not time efficient. I was wondering what is the best library to calculate Global efficiency for large networks like mine?
edit 29 Sept - corrected a typo where I had an indent that shouldn't be there
I looked at the networkx implementation and found an inefficiency (it considers each possible path independently, while there are ways to find many of the shortest paths all at once). I've improved the method.
Try this code:
def my_global_efficiency(G):
'''author Joel C Miller
https://stackoverflow.com/a/57032282/2966723
'''
n = len(G)
denom = n*(n-1)
if denom>0:
efficiency = 0
for path_collection in nx.all_pairs_shortest_path_length(G):
source = path_collection[0]
for target in path_collection[1]:
if target != source:
efficiency += 1./path_collection[1][target]
return efficiency/denom
else:
return 0
Sample use:
import networkx as nx
G = nx.fast_gnp_random_graph(500,0.04)
nx.global_efficiency(G)
#answers will vary based on G
> 0.44650033400070577
my_global_efficiency(G)
> 0.44650033400070543
The difference in the last 3 digits is a rounding issue. I think it is caused by some of the sums being done in a different order.
This will run significantly faster. However, it may not be enough of an improvement for your purposes.
An alternate improvement if your graph is undirected would be to go to the networkx code, replace denom by half of its value and change the permutations to combinations. Currently it looks at each pair of nodes and finds the distance in both directions. If it's undirected, you only need to do this once. So the change to combinations gives a factor of 2 improvement.
Depending on your graph it's not clear to me which change will be faster. And these may still be too slow for your purposes.
You can speed up the process a bit more by getting an approximate value. To do this, instead of using nx.all_pairs_shortest_path_length, sample a large number of randomly chosen sources and find the distances of each of those specific nodes from all of the other nodes in G using nx.single_source_shortest_path_length. So if you take N=100 sources then there will be denom=N*(n-1) paths considered where n is the total number of nodes in G. This should give over a factor of 300 speed up from the improved my_global_efficiency.

Find third coordinate of (right) triangle given 2 coordinates and ray to third

I start explaining my problem from very far, so you could suggest completely different approaches and understand custom objects and functions.
Over years I have recorded many bicycle GPS tracks (.gpx). I decided to merge these (mostly overlapping) tracks into a large graph and merge/remove most of track points. So far, I have managed to simplify tracks (feature in gpxpy module, that removes about 90% of track-points, while preserving positions of corners) and load them into my current program.
Current Python 3 program consists of loading gpx tracks and optimising graph with four scans. Here's planned steps in my program:
Import points from gpx (working)
Join points located close to each other (working)
Merge edges under small angles (Problem is with this step)
Remove points on straights (angle between both edges is over 170 degrees). Looks like it is working.
Clean-up by resetting unique indexing of points (working)
Final checking of all edges in graph.
In my program I started counting steps with 0, because first one is simply opening and parsing file. Stackoverflow doesn't let me to start ordering from 0.
To store graph, I have a dictionary punktid (points in estonian), where punkt (point) object is stored at key uid/ui (unique ID). Unique ID is also stored in point itself too. Weight attribute is used in 2-nd and 3-rd step to find average of points while taking into account earlier merges.
class punkt:
def __init__(self,lo,la,idd,edge=set(),ele=0, wei=1):
self.lng=lo #Longtitude
self.lat=la #Latitude
self.uid=idd #Unique ID
self.edges=edge #Set of neighbour nodes
self.att=ele #Elevation
self.weight=wei #Used to get weighted average
>>> punktid
{1: <__main__.punkt object at 0x0000006E9A9F7FD0>,
2: <__main__.punkt object at 0x0000006E9AADC470>, 3: ...}
>>> punktid[1].__dict__
{'weight': 90, 'uid': 9000, 'att': 21.09333333333333, 'lat': 59.41757, 'lng': 24.73907, 'edges': {1613, 1218, 1530}}
As you can see, there is a minor bug in clean-up, where uid was not updated. I have fixed it by now, but I left it in, so you could see scale of graph. Largest index in punktid was 1699/11787.
Getting to core problem
Let's say I have 3 points: A, B and C (i, lyhem(2) and lyhem(0) respectively in following code slice). A has common edge with B and C, but B and C might not have common edge. C is closer to A than B. To reduce size of graph, I want to move C closer to edge AB (while respecting weights of B and C) and redirect AB through C.
Solution I came up with is to find temporary point D on AB, which is closest to C. Then find weighted average between D and C, save it as E and redirect all C edges and AB to that. Simplified figure - note, that E=(C+D)/2 is not completely accurate. I cannot add more than two links, but I have additional 2 images illustrating my problem.
Biggest problem was finding coordinates of D. I found possible solution on Mathematica site, but it contains ± sign, because when finding coordinate there are two possible coordinates. But I have line, where point is located on. Anyway, I don't know how to implement it correctly and my code has become quite messy:
# 2-nd run: Merge edges under small angles
for i in set(punktid.keys()):
try:
naabrid1=frozenset(punktid[i].edges) # naabrid / neighbours
for e in naabrid1:
t=set(naabrid1)
t.remove(e)
for u in t:
try:
a=nurk_3(punktid[i], punktid[e], punktid[u]) #Returns angle EIU in degrees. 0<=a<=180
if a<10:
de=((punktid[i].lat-punktid[e].lat)**2+
((punktid[i].lng-punktid[u].lng))*2 **2) #distance i-e
du=((punktid[i].lat-punktid[u].lat)**2+
((punktid[i].lng-punktid[u].lng)*2) **2) #distance i-u
b=radians(a)
if du<de:
lyhem=[u,du,e] # lühem in English is shorter
else: # but currently it should be lähem/closer
lyhem=[e,de,u]
if sin(b)*lyhem[1]<r:
lr=abs(sin(b)*lyhem[1])
ml=tan(nurk_coor(punktid[i],punktid[lyhem[0]])) #Lühema tõus / Slope of closer (C)
mp=tan(nurk_coor(punktid[i],punktid[lyhem[2]])) #Pikema / ...farer / B
mr=-1/ml #Ristsirge / ...BD
p1=(punktid[i].lng+lyhem[1]*(1/(1+ml**2)**0.5), punktid[i].lat+lyhem[1]*(ml/(1+ml**2)**0.5))
p2=(punktid[i].lng-lyhem[1]*(1/(1+ml**2)**0.5), punktid[i].lat-lyhem[1]*(ml/(1+ml**2)**0.5))
d1=((punktid[lyhem[0]].lat-p1[1])**2+
((punktid[lyhem[0]].lng-p1[0])*2)**2)**0.5 #distance i-e
d2=((punktid[lyhem[0]].lat-p2[1])**2+
((punktid[lyhem[0]].lng-p2[0])*2)**2)**0.5 #distance i-u
if d1<d2: # I experimented with one idea,
x=p1[0]#but it made things worse.
y=p1[1]#Originally I simply used p1 coordinates
else:
x=p2[0]
y=p2[1]
lo=punktid[lyhem[2]].weight*p2[0] # Finding weighted average
la=punktid[lyhem[2]].weight*p2[1]
la+=punktid[lyhem[0]].weight*punktid[lyhem[0]].lat
lo+=punktid[lyhem[0]].weight*punktid[lyhem[0]].lng
kaal=punktid[lyhem[2]].weight+punktid[lyhem[0]].weight #kaal = weight
c=(la/kaal,lo/kaal)
punktid[ui]=punkt(c[1],c[0], ui,punktid[lyhem[0]].edges, punktid[lyhem[0]].att,kaal)
punktid[i].edges.remove(lyhem[2])
punktid[lyhem[2]].edges.remove(i)
try:
for n in punktid[ui].edges: #In all neighbours
try: #Remove link to old point
punktid[n].edges.remove(lyhem[0])
except KeyError:
pass #If it doesn't link to current
punktid[n].edges.add(ui) #And add new point
if log:
printf(punktid[n].edges,'naabri '+str(n)+' edges')
except KeyError: #If neighbour itself has been removed
pass #(in same merge), Ignore
punktid[ui].edges.add(lyhem[2])
punktid[lyhem[2]].edges.add(ui)
punktid.pop(lyhem[0])
ui+=1
except KeyError: # u has been removed
pass
except KeyError: # i has been removed
pass
This is a code segment and it is likely to not run after copy-pasting because of missing variables/functions. New point is being calculated on lines 22 to 43, in 3rd if-statement from beginning if sin(b)*lyhem[1]<r to punktid[ui]=... After that is redirecting old edges to new node.
Stating question clearly: How to find point on ray (AB), if two coordinates of line segment (AC) and angles at these points are known (angle ACB should be 90 degrees)? How to implement it in Python 3.5?
PS. (Meta) If somebody needs full source, how could I provide it (uploading single text file without registration)? Pastebin or pasting (spamming) it here? If I upload it to other site, how to provide link, if newbie users are limited to two?

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.
You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

Hash value for directed acyclic graph

How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. We can assume that the number of vertices in the graph is at most 11.
I am particularly interested in Python code.
Here is what I did. If self.lt is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). Then, I hash the sorted dictionary. Some isomorphic graphs will hash to different values, especially as the number of nodes grows.
I have included all the code to motivate my use case. I am calculating the number of comparisons required to find the median of 7 numbers. The more that isomorphic graphs hash to the same value the less work that has to be redone. I considered putting larger connected components first, but didn't see how to do that quickly.
from tools.decorator import memoized # A standard memoization decorator
class Graph:
def __init__(self, n):
self.lt = {i: set() for i in range(n)}
def compared(self, i, j):
return j in self.lt[i] or i in self.lt[j]
def withedge(self, i, j):
retval = Graph(len(self.lt))
implied_lt = self.lt[j] | set([j])
for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
retval.lt.items()):
lt_k |= lt_s
if i in lt_k or k == i:
lt_k |= implied_lt
return retval.toposort()
def toposort(self):
mapping = {}
while len(mapping) < len(self.lt):
for i, lt_i in self.lt.items():
if i in mapping:
continue
if any(i in lt_j or len(lt_i) < len(lt_j)
for j, lt_j in self.lt.items()
if j not in mapping):
continue
mapping[i] = len(mapping)
retval = Graph(0)
for i, lt_i in self.lt.items():
retval.lt[mapping[i]] = {mapping[j]
for j in lt_i}
return retval
def median_known(self):
n = len(self.lt)
for i, lt_i in self.lt.items():
if len(lt_i) != n // 2:
continue
if sum(1
for j, lt_j in self.lt.items()
if i in lt_j) == n // 2:
return True
return False
def __repr__(self):
return("[{}]".format(", ".join("{}: {{{}}}".format(
i,
", ".join(str(x) for x in lt_i))
for i, lt_i in self.lt.items())))
def hashkey(self):
return tuple(sorted({k: tuple(sorted(v))
for k, v in self.lt.items()}.items()))
def __hash__(self):
return hash(self.hashkey())
def __eq__(self, other):
return self.hashkey() == other.hashkey()
#memoized
def mincomps(g):
print("Calculating:", g)
if g.median_known():
return 0
nodes = g.lt.keys()
return 1 + min(max(mincomps(g.withedge(i, j)),
mincomps(g.withedge(j, i)))
for i in nodes
for j in nodes
if j > i and not g.compared(i, j))
g = Graph(7)
print(mincomps(g))
To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs.
Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py:
diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py 2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py 2013-01-28 22:09:07.000000000 -0200
## -31,7 +31,9 ##
ext_pynauty = Extension(
name = MODULE + '._pynauty',
- sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+ sources = [ pynauty_dir + '/' + 'pynauty.c',
+ os.path.join(nauty_dir, 'schreier.c'),
+ os.path.join(nauty_dir, 'naurng.c')],
depends = [ pynauty_dir + '/' + 'pynauty.h', ],
extra_compile_args = [ '-O4' ],
extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c 2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c 2013-01-29 00:38:36.000000000 -0200
## -320,7 +320,7 ##
PyObject *adjlist;
PyObject *p;
- int i,j;
+ Py_ssize_t i, j;
int adjlist_length;
int x, y;
Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.
Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.
The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n2 bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.
How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2110 remaining graphs are acyclic and for each graph there may be up to 11! - roughly 225 - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?
How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.
There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.
How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:
the total number of nodes
the total number of (directed) connections
the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n))
All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.
As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c instead of a -> b).
This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:
same as above but with second order in- and outdegree. Ie. the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.
or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) )
number of nodes with eccentricity = x (again for any x)
if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. Due to the xor the specific order in which the nodes where added to the hash does not matter.
You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).
Of course this is much more complex though.
Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).
I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:
Sort the hashes of the node's neighbors
Hash the concatenated sorted hashes
On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).
For more on this you can look at my post here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py
Imho, If the graph could be topologically sorted, the very straightforward solution exists.
For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (p.e. if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334})
For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N);
I think the complexity of this procedure is around (N*N) in the worst case. If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part)
I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.
A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:
1->2, 1->5
2->1, 2->4
3->4
5->3
What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}), here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3).
Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn), Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch.
In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5.
The alphabet(set of possible letters) for the set {1,2,3} would be(ordered lexicographically):
{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}
First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.
Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N. Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N. This is to show that the hash code grows really fast with the increase of N. Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.
There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++ but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:
typedef long long ll;
// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination.
ll code(vector<int> a,int n)
{
sort(a.begin(),a.end()); // not needed if the set you pass is already sorted.
int cur = 0;
int m = a.size();
ll res =0;
for(int i=0;i<a.size();i++)
{
if(a[i] == cur+1)
{
res++;
cur = a[i];
continue;
}
else
{
res++;
int number_of_greater_nums = n - a[i];
for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
res += 1LL << (number_of_greater_nums+increment);
cur = a[i];
}
}
return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the
// combination
vector<int> decode(ll kod, int n)
{
vector<int> res;
int cur = 0;
int left = n; // Out of how many numbers are we left to choose.
while(kod)
{
ll all = 1LL << left;// how many are the total combinations
for(int i=n;i>=0;i--)
{
if(all - (1LL << (n-i+1)) +1 <= kod)
{
res.push_back(i);
left = n-i;
kod -= all - (1LL << (n-i+1)) +1;
break;
}
}
}
return res;
}
Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.
Also have a look at this question.
I'm not sure that it's 100% working, but here is an idea:
Let's code a graph into a string and then take its hash.
hash of an empty graph is ""
hash of a vertex with no outgoing edges is "."
hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (e.g. ",")
To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (e.g. in lexicographical order).
For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots).
edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. That happens when a vertex has several parents - then it "duplicated" for every parent. For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}.
I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily:
THash GetHash(const TGraph &graph)
{
return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
std::vector<std::string> childHashes;
for(auto c:graph.GetChildren(vertex))
childHashes.push_back(GetVertexStringCode(graph,*c));
std::sort(childHashes.begin(),childHashes.end());
std::string result=".";
for(auto h:childHashes)
result+=*h+",";
return result;
}
I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. This proposal is therefore based on isomorphism only.
For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. Here is a starter list:
2d histogram of nodes' in and out degrees.
4d histogram of edges a->b where a and b are both characterized by in/out degree.
Addition
Let me be more explicit. For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I,O values), signifying that there are N nodes with in-degree I and out-degree O. You'd hash this set of triples or better yet use the whole set arranged in some canonical order e.g. lexicographically sorted. For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO, to nodes with bI and bO respectively. Again hash these quintuples or else use them in canonical order as-is for another part of the final hash.
Starting with this and then looking at collisions that still occur will probably provide insights on how to get better.
When I saw the question, I had essentially the same idea as #example. I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs.
This tag consists of the sequence of out-degrees in ascending order. You can hash this tag with the string hash function of your choice to obtain a hash of the graph.
Edit: I expressed my proposal in the context of #NeilG's original question. The only modification to make to his code is to redefine the hashkey function as:
def hashkey(self):
return tuple(sorted(map(len,self.lt.values())))
With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification.
Example code in this StackOverflow answer, the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent.
Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children.

Categories