Finding the (guaranteed unique) path between two nodes in a tree - python

I have a (likely) simple graph traversal question. I'm a graph newbie using networkx as my graph data structures. My graphs always look like this:
0
1 8
2 3 9 10
4 5 6 7 11 12 13 14
I need to return the path from the root node to a given node (eg., path(0, 11) should return [0, 8, 9, 11]).
I have a solution that works by passing along a list which grows and shrinks to keep track of what the path looks like as you traverse the tree, ultimately getting returned when the target node is found:
def VisitNode(self, node, target, path):
path.append(node)
# Base case. If we found the target, then notify the stack that we're done.
if node == target:
return True
else:
# If we're at a leaf and it isn't the target, then pop the leaf off
# our path (backtrack) and notify the stack that we're still looking
if len(self.neighbors(node)) == 0:
path.pop()
return False
else:
# Sniff down the next available neighboring node
for i in self.neighbors_iter(node):
# If this next node is the target, then return the path
# we've constructed so far
if self.VisitNode(i, target, path):
return path
# If we've gotten this far without finding the target,
# then this whole branch is a dud. Backtrack
path.pop()
I feel in my bones that there is no need for passing around this "path" list... I should be able to keep track of that information using the call stack, but I can't figure out how... Could someone enlighten me on how you would solve this problem recursively using the stack to keep track of the path?

You could avoid passing around the path by returning None on failure, and a partial path on success. In this way, you do not keep some sort of 'breadcrumb trail' from the root to the current node, but you only construct a path from the target back to the root if you find it. Untested code:
def VisitNode(self, node, target):
# Base case. If we found the target, return target in a list
if node == target:
return [node]
# If we're at a leaf and it isn't the target, return None
if len(self.neighbors(node)) == 0:
return None
# recursively iterate over children
for i in self.neighbors_iter(node):
tail = self.VisitNode(i, target)
if tail: # is not None
return [node] + tail # prepend node to path back from target
return None #none of the children contains target
I don't know the graph library you are using, but I assume that even leafs contain a neighbours_iter method, which obviously shouldn't yield any children for a leaf. In this case, you can leave out the explicit check for a leaf, making it a bit shorter:
def VisitNode(self, node, target):
# Base case. If we found the target, return target in a list
if node == target:
return [node]
# recursively iterate over children
for i in self.neighbors_iter(node):
tail = self.VisitNode(i, target)
if tail: # is not None
return [node] + tail # prepend node to path back from target
return None # leaf node or none of the child contains target
I also removed some of the else statements, since inside the true-part of the if you are returning from the function. This is common refactering pattern (which some old-school people don't like). This removes some unnecessary indentation.

You can avoid your path argument at all having path initialized in the method's body. If method returns before finding a full path, it may return an empty list.
But your question is also about using a stack instead of a list in Depth-First-search implementation, right? You get a flavor here: http://en.literateprograms.org/Depth-first_search_%28Python%29.
In a nutshell, you
def depthFirstSearch(start, isGoal, result):
###ensure we're not stuck in a cycle
result.append(start)
###check if we've found the goal
###expand each child node in order, returning if we find the goal
# No path was found
result.pop()
return False
with
###<<expand each child node in order, returning if we find the goal>>=
for v in start.successors:
if depthFirstSearch(v, isGoal, result):
return True
and
###<<check if we've found the goal>>=
if isGoal(start):
return True

Use networkx directly:
all_simple_paths(G, source, target, cutoff=None)

Related

Why is ''.join() significantly slower than string concatenation here?

Problem link: 2096. Step-By-Step Directions From a Binary Tree Node to Another.
I was solving a problem that asks us to output the shortest step-by-step path between a start node and a destination node in a binary tree, where traveling to left and right children are denoted by ‘L’ and ‘R’, and traveling to a parent node is denoted by ‘U’. We are given the reference to the root node, and the number of nodes n can be as large as 100,000.
My solution is as follows:
Run a BFS to find both start and destination nodes, while generating the paths along the way using arrays. Once we find both nodes, return the path arrays.
Now that we have paths to both start and destination nodes, find the lowest common ancestor of the two nodes by throwing out any initial common letters (e.g. if startPath = [‘L’, ‘R’, ‘L’] and destPath = [‘L’, ‘L’, ‘R’], the lowest common ancestor is root.left, and the remaining startPath from this lowest common ancestor is [‘R’, ‘L’].)
Finally, to get the path between startNode and destNode, we can convert all remaining startPath letters to ‘U’s, and then add on the remaining destPath.
The relevant code pertaining to the question is as follows:
def bfs_path2(root, target1, target2):
q = deque([(root, [])])
while q and not (found1 and found2):
node, path = q.popleft()
if node.left:
q.append((node.left, path + ['L']))
if node.right:
q.append((node.right, path + ['R']))
return path1, path2
s_path, t_path = bfs_path2(root, startValue, destValue)
# i, j legal indices of s_path and t_path
return "".join(['U']*len(s_path[i:]) + t_path[j:])
The runtime of this against a large test case was >10s. However, if I change the implementation of the BFS queue elements to strings, it runs in ~3s:
def bfs_path2(root, target1, target2):
q = deque([(root, '')])
while q and not (found1 and found2):
node, path = q.popleft()
if node.left:
q.append((node.left, path + 'L'))
if node.right:
q.append((node.right, path + 'R'))
return path1, path2
s_path, t_path = bfs_path2(root, startValue, destValue)
# i, j legal indices of s_path and t_path
return 'U'*len(s_path[i:]) + t_path[j:]
There are many links on StackOverflow showing that string concatenation is far slower than the join() method in Python, so I’m confused as to why the first code runtime is much slower than the second code runtime. Am I missing something here?

What is best: Global Variable or Parameter in this python function?

I have a question about the following code, but i guess applies to different functions.
This function computes the maximum path and its length for a DAG, given the Graph, source node, and end node.
To keep track of already computed distances across recursions I use "max_distances_and_paths" variable, and update it on each recursion.
Is it better to keep it as a function parameter (inputed and outputed across recursions) or
use a global variable and initialize it outside the function?
How can avoid to have this parameter returned when calling the function externally (i.e it
has to be outputed across recursions but I dont care about its value, externally)?
a better way than doing: LongestPath(G, source, end)[0:2] ??
Thanks
# for a DAG computes maximum distance and maximum path nodes sequence (ordered in reverse).
# Recursively computes the paths and distances to edges which are adjacent to the end node
# and selects the maximum one
# It will return a single maximum path (and its distance) even if there are different paths
# with same max distance
# Input {Node 1: adj nodes directed to Node 1 ... Node N: adj nodes directed to Node N}
# Example: {'g': ['r'], 'k': ['g', 'r']})
def LongestPath(G, source, end, max_distances_and_paths=None):
if max_distances_and_paths is None:
max_distances_and_paths = {}
max_path = [end]
distances_list = []
paths_list = []
# return max_distance and max_path from source to current "end" if already computed (i.e.
# present in the dictionary tracking maximum distances and correspondent distances)
if end in max_distances_and_paths:
return max_distances_and_paths[end][0], max_distances_and_paths[end][1], max_distances_and_paths
# base case, when end node equals source node
if source == end:
max_distance = 0
return max_distance, max_path, max_distances_and_paths
# if there are no adjacent nodes directed to end node (and is not the source node, previous case)
# means path is disconnected
if len(G[end]) == 0:
return 0, [0], {"": []}
# for each adjacent node pointing to end node compute recursively its max distance to source node
# and add one to get the distance to end node. Recursively add nodes included in the path
for t in G[end]:
sub_distance, sub_path, max_distances_and_paths = LongestPath(G, source, t, max_distances_and_paths)
paths_list += [[end] + sub_path]
distances_list += [1 + sub_distance]
# compute max distance
max_distance = max(distances_list)
# access the same index where max_distance is, in the list of paths, to retrieve the path
# correspondent to the max distance
index = [i for i, x in enumerate(distances_list) if x == max_distance][0]
max_path = paths_list[index]
# update the dictionary tracking maximum distances and correspondent paths from source
# node to current end node.
max_distances_and_paths.update({end: [max_distance, max_path]})
# return computed max distance, correspondent path, and tracker
return max_distance, max_path, max_distances_and_paths
Global variables are generally avoided due to several reasons (see Why are global variables evil?). I would recommend sending the parameter in this case. However, you could define a larger function housing your recursive function. Here's a quick example I wrote for a factorial code:
def a(m):
def b(m):
if m<1:return 1
return m*b(m-1)
n = b(m)
m=m+2
return n,m
print(a(6))
This will give: (720, 8). This proves that even if you used the same variable name in your recursive function, the one you passed in to the larger function will not change. In your case, you want to just return n as per my example. I only returned an edited m value to show that even though both a and b functions have m as their input, Python separates them.
In general I would say avoid the usage of global variables. This is because is makes you code harder to read and often more difficult to debug if you codebase gets a bit more complex. So it is good practice.
I would use a helper function to initialise your recursion.
def longest_path_helper(G, source, end, max_distances_and_paths=None):
max_distance, max_path, max_distances_and_paths = LongestPath(
G, source, end, max_distances_and_paths
)
return max_distance, max_path, max_distances_and_paths
On a side note, in Python it is convention to write functions without capital letters and separated with underscores and Capicalized without underscores are used for classes. So it would be more Pythonic to use def longest_path():

Prohibitively slow execution of function compute_resilience in Python

The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
queue.append(node)
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
def compute_resilience(ugraph, attack_order):
"""
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
"""
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
ugraph[neighbor].remove(node)
ugraph.pop(node)
res.append(largest_cc_size(ugraph))
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
Review
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
queue.append(node)
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
else:
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
remaining_nodes.difference_update(visited)
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes.difference_update(visited)
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
res.append(largest_cc_size(ugraph))
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
Returns:
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
"""
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
nodes.difference_update(component)
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
"""
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
graph[neighbor].remove(node)
graph.pop(node)
component.remove(node)
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
canonical.update(canon)
components.update(comp)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
"""
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
graph[i].add(j)
graph[j].add(i)
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
0.28782312001567334
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
59.46968446299434
So the revised code is about 200 times faster on this test case.

NetworkX find root_node for a particular node in a directed graph

Suppose I have a directed graph G in Network X such that:
G has multiple trees in it
Every node N in G has exactly 1 or 0
parent's.
For a particular node N1, I want to find the root node of the tree it resides in (its ancestor that has a degree of 0). Is there an easy way to do this in network x?
I looked at:
Getting the root (head) of a DiGraph in networkx (Python)
But there are multiple root nodes in my graph. Just only one root node that happens to be in the same tree as N1.
edit Nov 2017 note that this was written before networkx 2.0 was released. There is a migration guide for updating 1.x code into 2.0 code (and in particular making it compatible for both)
Here's a simple recursive algorithm. It assumes there is at most a single parent. If something doesn't have a parent, it's the root. Otherwise, it returns the root of its parent.
def find_root(G,node):
if G.predecessors(node): #True if there is a predecessor, False otherwise
root = find_root(G,G.predecessors(node)[0])
else:
root = node
return root
If the graph is a directed acyclic graph, this will still find a root, though it might not be the only root, or even the only root ancestor of a given node.
I took the liberty of updating #Joel's script. His original post did not work for me.
def find_root(G,child):
parent = list(G.predecessors(child))
if len(parent) == 0:
print(f"found root: {child}")
return child
else:
return find_root(G, parent[0])
Here's a test:
G = nx.DiGraph(data = [('glu', 'skin'), ('glu', 'bmi'), ('glu', 'bp'), ('glu', 'age'), ('npreg', 'glu')])
test = find_root(G, "age")
age
glu
npreg
found root: npreg
Networkx - 2.5.1
The root/leaf node can be found using the edges.
for node_id in graph.nodes:
if len(graph.in_edges(node_id)) == 0:
print("root node")
if len(graph.out_edges(node_id)) == 0:
print("leaf node")
In case of multiple roots, we can do something like this:
def find_multiple_roots(G, nodes):
list_roots = []
for node in nodes:
predecessors = list(G.predecessors(node))
if len(predecessors)>0:
for predecessor in predecessors:
list_roots.extend(find_root(G, [predecessor]))
else:
list_roots.append(node)
return list_roots
Usage:
# node need to be passed as a list
find_multiple_roots(G, [node])
Warning: This recursive function can explode pretty quick (the number of recursive functions called could be exponentially proportional to the number of nodes exist between the current node and the root), so use it with care.

Does networkx support dfs traversal by label

The networkx dfs_edges() function will iterate over child nodes. As far as I can tell, the http://networkx.lanl.gov/ documentation does not specify a parameter into dfs_edges() to only traverse if edges have a specific label.
Also, I looked at dfs_labeled_edges() but that only tells you the traversal direction while iterating over a graph with DFS.
There is no option to only traverse edges with a given label. If you don't mind making a copy of the graph you can build a new graph with only the edges with the specific label you want.
If that doesn't work it wouldn't be that hard to modify the source code of dfs_edges() to do that. e.g.
if source is None:
# produce edges for all components
nodes=G
else:
# produce edges for components with source
nodes=[source]
visited=set()
for start in nodes:
if start in visited:
continue
visited.add(start)
stack = [(start,iter(G[start]))] <- edit here
while stack:
parent,children = stack[-1]
try:
child = next(children)
if child not in visited:
yield parent,child
visited.add(child)
stack.append((child,iter(G[child]))) <- and edit here
except StopIteration:
stack.pop()
I have an approach which is working for me. Thanks #Aric for the inspiration.
It is at https://github.com/namoopsoo/networkx/blob/master/networkx/algorithms/traversal/depth_first_search.py
It is a new function called dfs_edges_by_label() . And given a label as an input, it only traverses edges matching the label.

Categories