Computing many shortest paths in graph - python

I have a large (weighted, directed) graph (>100,000 nodes) and I want to compute a large number of random shortest paths in that graph. So I want to randomly select two nodes (let's say k times) and compute the shortest path. One way to do this is using either the networkx or the igraph module and doing a for loop as in
pairs=np.random.choice(np.arange(0,len(graph.nodes)), [k,2])
for pair in pairs:
graph.get_shortest_paths(pair[0],pair[1], weights='weight')
This works, but it takes a long time. Especially, compared to computing all paths for a particular source node. Essentially, in every iteration the process loads the graph again and starts the process from scratch. So is there a way to benefit from loading the graph structure in to memory and not redoing this in each iteration without computing all shortest paths (which would take too long given that those would be n*(n-1) paths).
Phrased differently, can I compute a random subset of all shortest paths in an efficient way?

AFAIK, the operations are independent of each other, so running them in parallel could work (pseudocode):
import dask
#dask.delayed
def short_path(graph, pair):
return graph.get_shortest_paths(pair[0],pair[1], weights='weight')
pairs=np.random.choice(np.arange(0,len(graph.nodes)), [k,2])
results = dask.compute(*[short_path(pair) for pair in pairs])

Related

What's the fastest way of finding the longest chain of strings with common first/last characters?

This one seems pretty straightforward but I'm having trouble implementing it in Python.
Say you have an array of strings, all of length 2: ["ab","bc","cd","za","ef","fg"]
There are two chains here: "zabcd" and "efg"
What's the most efficient way of finding the longest one? (in this case it would be "zabcd").
Chains can be cyclical too... for instance ["ab","bc","ca"] (in this particular case the length of the chain would be 3).
This is clearly a graph problem, with the characters being the vertices and the pairs being unweighted directed edges.
Without allowing cycles in the solution, this is the longest path problem, and it is NP-hard, so "efficient" is probably out of the window, even if cycles are allowed (to get rid of the solution cycles, split the vertices in two, one for incoming edges and one for out-going edges, with an edge in-between). According to Wikipedia, no good approximation scheme is known.
If the graph is acyclic, then you can do it in linear time, as the wikipedia article mentions:
A longest path between two given vertices s and t in a weighted graph
G is the same thing as a shortest path in a graph −G derived from G by
changing every weight to its negation. Therefore, if shortest paths
can be found in −G, then longest paths can also be found in G.[4]
For most graphs, this transformation is not useful because it creates
cycles of negative length in −G. But if G is a directed acyclic graph,
then no negative cycles can be created, and a longest path in G can be
found in linear time by applying a linear time algorithm for shortest
paths in −G, which is also a directed acyclic graph.[4] For instance,
for each vertex v in a given DAG, the length of the longest path
ending at v may be obtained by the following steps:
Find a topological ordering of the given DAG. For each vertex v of the
DAG, in the topological ordering, compute the length of the longest
path ending at v by looking at its incoming neighbors and adding one
to the maximum length recorded for those neighbors. If v has no
incoming neighbors, set the length of the longest path ending at v to
zero. In either case, record this number so that later steps of the
algorithm can access it. Once this has been done, the longest path in
the whole DAG may be obtained by starting at the vertex v with the
largest recorded value, then repeatedly stepping backwards to its
incoming neighbor with the largest recorded value, and reversing the
sequence of vertices found in this way.
There are other special cases where there are efficient algorithms available, notably trees, but since you allow cycles that probably doesn't apply to you.
I didn't give an algorithm here for you, but that should give you the right direction for your research. The problem itself may be straightforward, but an efficient solution is NOT.

(Python graph-tool) graph-tool search using OpenMP? Can finding all paths between a source and target vertex be made parallel?

I am currently using graph_tool.topology.all_paths to find all paths between two vertices of a specific length, but the documentation doesn't explicitly say what algorithm is used. I assume it is either the breadth-first search (BFS) or Dijkstra’s algorithm like shortest_path or all_shortest_paths?
My graph is unweighted and directed, so therefore is there a way to make my searches using all_paths parallel, and use more cores? I know I have OpenMP turned on using openmp_enabled() and that it is set to use all 8 cores that I have.
I have seen that some algorithms such as the DFS cannot be made parallel, but I don't understand why searching through my graph to find all paths up to a certain length is not being done using multiple cores, especially when the Graph-tool performance comparison page has benchmarks for shortest path using multiple cores.
Running graph_tool.topology.all_paths(g, source, target, cutoff=4) using a basic function:
def find_paths_of_length(graph, path_length, start_vertex, end_vertex):
savedpath=[]
for path in graph_tool.topology.all_paths(graph, start_vertex, end_vertex, cutoff=path_length):
savedpath.append(path)
only uses 1 core. Is there any way that this can be done in parallel? My network contains on the order of 50 million vertices and 200 million edges, and the algorithm is O(V+E) according to the documentation.
Thanks in advance

Algorithm designing

Which one is more suitable for designing an algorithm that produces all the paths between two vertices in a directed graph?
Backtracking
Divide and conquer
Greedy approach
Dynamic programming
I was thinking of Backtracking due to the BFS and DFS, but I am not sure. Thank you.
Note that there can be an exponential number of paths in your output.
Indeed, in a directed graph of n vertices having an edge i -> j for every pair i < j, there are 2n-2 paths from 1 to n: each vertex except the endpoints can be either present in the path or omitted.
So, if we really want to output all paths (and not, e.g, make a clever lazy structure to list them one by one later) no advanced technique can help achieve polynomial complexity here.
The simplest way to find all the simple paths is recursively constructing a path, and adding the current path to the answer once we arrive at the end vertex.
To improve it, we can use backtracking.
Indeed, for each vertex, we can first compute whether the final vertex is reachable from it, and do so in polynomial time.
Later, we just use only the vertices for which the answer was positive.

Check if path exists that contains a set of N nodes

Given a graph g and a set of N nodes my_nodes = [n1, n2, n3, ...], how can I check if there's a path that contains all N nodes?
Checking among all_simple_paths for paths that contain all nodes in my_nodes becomes computationally cumbersome as the graph grows
The search above can be limited to paths between my_nodes pairwise couples. This reduces complexity only to a small degree. Plus it requires a lot of python looping, which is quite slow
Is there a faster solution to the problem?
You may try out some greedy algorithm here, starting the path find check from all the nodes to find, and step by step explore your graph. Can't provide some real sample, but pseudo-code should be something like this:
Start n path stubs from all your n nodes to find
For all these path stubs adjust them by all the neighbors which weren't checked before
If you have some intersection between path stubs, then you got a new one, which does contain more of your needed nodes than before
If after merging the stub paths you have the one which covers all needed nodes, you're done
If there are still some additional nodes to add to the path, you continue with second step again
If there are no nodes left in graph, the path doesn't exists
This algorithm has complexity O(E + N), because you're visiting the edges and nodes in non-recursive fashion.
However, in case of directed graph the "merge" will be a bit more complicated, yet still be done, but in this case the worst scenario may take a lot of time.
Update:
As you say that the graph is directed, the above approach wouldn't work well. In this case you may simplify your task like this:
Find the strongly connected components in graph (I suggest you to implement it by yourself, e.g., Kosaraju's algorithm). The complexity is O(E + N). You can use a NetworkX method for this, if you want some out-ofbox solution.
Create the condensation of graph, based on step 1 information, with saving the information about which component can be visited from other. Again, there is a NetworkX method for this.
Now you can easily say, which nodes from your set are in the same component, so a path containing all of them definitely exists.
After that all you need to check is a connectivity between different components for your nodes. For example, you can get the topological sort of condensation and do check in linear time again.

Find cliques of length k in a graph

I'm working with graphs of ~200 nodes and ~3500 edges. I need to find all cliques of this graph. Using networkx's enumerate_all_cliques() works fine with smaller graphs of up to 100 nodes, but runs out of memory for bigger ones.
"This algorithm however, hopefully, does not run out of memory
since it only keeps candidate sublists in memory and
continuously removes exhausted sublists."source code for enumerate_all_cliques()
Is there maybe a way to return a generator of all cliques of length k, instead of all cliques, in order to save memory?
It seems that your priority is to save memory rather than getting all cliques. In that case the use of networkx.find_cliques(G) is a satisfactory solution as you will get all maximal cliques (largest complete subgraph containing a given node) instead of all cliques.
I compared the number of lists (subgraphs) of both functions:
G = nx.erdos_renyi_graph(300,0.08)
print 'All:',len(list(nx.enumerate_all_cliques(G)))
print 'Maximal',len(list(nx.find_cliques(G)))
All: 6087
Maximal 2522
And when the number of edges increases in the graph the difference in results gets wider.

Categories