Why is ''.join() significantly slower than string concatenation here? - python

Problem link: 2096. Step-By-Step Directions From a Binary Tree Node to Another.
I was solving a problem that asks us to output the shortest step-by-step path between a start node and a destination node in a binary tree, where traveling to left and right children are denoted by ‘L’ and ‘R’, and traveling to a parent node is denoted by ‘U’. We are given the reference to the root node, and the number of nodes n can be as large as 100,000.
My solution is as follows:
Run a BFS to find both start and destination nodes, while generating the paths along the way using arrays. Once we find both nodes, return the path arrays.
Now that we have paths to both start and destination nodes, find the lowest common ancestor of the two nodes by throwing out any initial common letters (e.g. if startPath = [‘L’, ‘R’, ‘L’] and destPath = [‘L’, ‘L’, ‘R’], the lowest common ancestor is root.left, and the remaining startPath from this lowest common ancestor is [‘R’, ‘L’].)
Finally, to get the path between startNode and destNode, we can convert all remaining startPath letters to ‘U’s, and then add on the remaining destPath.
The relevant code pertaining to the question is as follows:
def bfs_path2(root, target1, target2):
q = deque([(root, [])])
while q and not (found1 and found2):
node, path = q.popleft()
if node.left:
q.append((node.left, path + ['L']))
if node.right:
q.append((node.right, path + ['R']))
return path1, path2
s_path, t_path = bfs_path2(root, startValue, destValue)
# i, j legal indices of s_path and t_path
return "".join(['U']*len(s_path[i:]) + t_path[j:])
The runtime of this against a large test case was >10s. However, if I change the implementation of the BFS queue elements to strings, it runs in ~3s:
def bfs_path2(root, target1, target2):
q = deque([(root, '')])
while q and not (found1 and found2):
node, path = q.popleft()
if node.left:
q.append((node.left, path + 'L'))
if node.right:
q.append((node.right, path + 'R'))
return path1, path2
s_path, t_path = bfs_path2(root, startValue, destValue)
# i, j legal indices of s_path and t_path
return 'U'*len(s_path[i:]) + t_path[j:]
There are many links on StackOverflow showing that string concatenation is far slower than the join() method in Python, so I’m confused as to why the first code runtime is much slower than the second code runtime. Am I missing something here?

Related

Optimizing a path finding alg. from grid with rules

I'm trying to code a path finding algorithm, for this problem: You're given a grid with values, and your goal is to find the longest path from the highest point to the lowest point (least steep one).
I experimented with code I on here [https://stackoverflow.com/questions/68464767/find-a-path-from-grid-with-rules], but for large grids (e.g. 140x140). Since the code is checking every single possible route, it's taking really long to calculate it. Could I implement any stopping solution that would just not continue finding the path is the path isn't optimal ?
As i said earlier, i found code on here that works, but is just way too slow for me to use. Here's my code with the grid. The code works perfectly for smaller grids.
The code below itsn't full, because the grid is too large to be posted here. The full code can be found here - https://pastebin.com/q4nKS4rS
def find_paths_recursive(grid, current_path=[(136,136)], solutions=[]):
n = len(grid)
dirs = [(-1,0), (1,0), (0,1), (0,-1)]
last_cell = current_path[-1]
for x,y in dirs:
new_i = last_cell[0] + x
new_j = last_cell[1] + y
# Check if new cell is in grid
if new_i<0 or new_i>=n or new_j<0 or new_j>=n:
continue
# Check if new cell has bigger value than last
if grid[new_i][new_j] > grid[last_cell[0]][last_cell[1]]:
continue
# Check if new cell is already in path
if (new_i, new_j) in current_path:
continue
# Add cell to current path
current_path_copy = current_path.copy()
current_path_copy.append((new_i, new_j))
if new_i==0 and new_j ==0:
solutions.append(current_path_copy)
print(current_path_copy)
# Create new current_path array for every direction
find_paths_recursive(grid, current_path_copy, solutions)
return solutions
def compute_cell_values(grid1, solutions):
path_values = []
for solution in solutions:
solution_values = []
for cell in solution:
solution_values.append(grid1[cell[0]][cell[1]])
path_values.append(solution_values)
return path_values
grid1 = [...]
solutions = find_paths_recursive(grid1)
path_values = compute_cell_values(grid1, solutions)
print('Solutions:')
print(solutions)
print('Values:')
print(path_values)

How to find the optimal path for a graph with weighted edges using depth first search method?

I am trying to solve "Problem Set 2: Fastest Way to Get Around MIT" from MIT Course Number 6.0002:
In this problem set you will solve a simple optimization problem on a graph. Specifically, you will find the shortest route from one building to another on the MIT campus given that you wish to constrain the amount of time you spend walking outdoors (in the cold). [...]
Problem 3: Find the Shortest Path using Optimized Depth First Search
In our campus map problem, the total distance traveled on a path is equal to the sum of all total distances traveled between adjacent nodes on this path. Similarly, the distance spent outdoors on the path is equal to the sum of all distances spent outdoors on the edges in the path.
Depending on the number of nodes and edges in a graph, there can be multiple valid paths from one node to another, which may consist of varying distances. We define the shortest path between two nodes to be the path with the least total distance traveled . You are trying to minimize the distance traveled while not exceeding the maximum distance outdoors.
How do we find a path in the graph? Work off the depth-first traversal algorithm covered in lecture to discover each of the nodes and their children nodes to build up possible paths. Note that you’ll have to adapt the algorithm to fit this problem. [...]
Problem 3b: Implement get_best_path
Implement the helper function get_best_path. Assume that any variables you need have been set correctly in directed_dfs. Below is some pseudocode to help get you started.
if start and end are not valid nodes:
raise an error
elif start and end are the same node:
update the global variables appropriately
else:
for all the child nodes of start
construct a path including that node
recursively solve the rest of the path, from the child node to the end node
return the shortest path
I can't figure out what I am doing wrong in the algorithm to find the shortest path using the depth-first search method.
I tried unweighted edges and it works fine for that, but when I try weighted edges it does not return the shortest path.
def get_best_path(digraph, start, end, path, max_dist_outdoors, best_dist,
best_path):
"""
Finds the shortest path between buildings subject to constraints.
Parameters:
digraph: Digraph instance
The graph on which to carry out the search
start: string
Building number at which to start
end: string
Building number at which to end
path: list composed of [[list of strings], int, int]
Represents the current path of nodes being traversed. Contains
a list of node names, total distance traveled, and total
distance outdoors.
max_dist_outdoors: int
Maximum distance spent outdoors on a path
best_dist: int
The smallest distance between the original start and end node
for the initial problem that you are trying to solve
best_path: list of strings
The shortest path found so far between the original start
and end node.
Returns:
a list of building numbers (in strings), [n_1, n_2, ..., n_k],
where there exists an edge from n_i to n_(i+1) in digraph,
for all 1 <= i < k and the distance of that path.
If there exists no path that satisfies max_total_dist and
max_dist_outdoors constraints, then return None.
"""
# TODO
# put the first node in the path on each recursion
path[0] = path[0] + [start]
# if start and end nodes are same then return the path
if start == end:
return tuple(path[0])
# create a node from the start point name
start_node = Node(start)
# for each edge starting at that start node, call the function recursively
# if the destination node is not already in path
# and if the best_dist has not been found yet or it is greater than the total distance
# current path
for an_edge in digraph.get_edges_for_node(start_node):
# get the destination node for the edge
a_node = an_edge.get_destination()
# update the total distance traveled so far
path[1] = path[1] + an_edge.get_total_distance()
# update the distance spent outside
path[2] = path[2] + an_edge.get_outdoor_distance()
# if the node is not in path
if str(a_node) not in path[0]:
# if the best_distance is none or greater than distance of current path
if path[1] < best_dist and path[2] < max_dist_outdoors:
new_path = get_best_path(digraph, str(a_node), end, [path[0], path[1], path[2]], max_dist_outdoors, best_dist, best_path)
if new_path != None:
best_dist = path[1]
print('best_dist', best_dist)
best_path = new_path
return best_path
def get_best_path(digraph, start, end, path, max_dist_outdoors, best_dist, best_path):
start_node = Node(start)
end_node = Node(end)
path[0] = path[0] + [start]
if len(path[0]) > 1:
dist, out_dist = get_distances_for_node(digraph, Node(path[0][-2]), start_node)
path[1] = path[1] + dist
path[2] = path[2] + out_dist
if not digraph.has_node(start_node) and not digraph.has_node(end_node):
raise ValueError('The graph does not contain these nodes')
elif start_node == end_node:
return (path[0], path[1])
else:
for an_edge in digraph.get_edges_for_node(start_node):
next_node = an_edge.get_destination()
if str(next_node) not in path[0]:
expected_dist = path[1] + an_edge.get_total_distance()
expected_out_dist = path[2] + an_edge.get_outdoor_distance()
if expected_dist < best_dist and expected_out_dist <= max_dist_outdoors:
#print('best_path_check', path[1], best_dist)
new_path = get_best_path(digraph, str(next_node), end, [path[0], path[1], path[2]], max_dist_outdoors, best_dist, best_path)
#print(new_path)
if new_path[0] != None:
best_path = new_path[0]
best_dist = new_path[1]
return (best_path, best_dist)
def get_distances_for_node(digraph, src, dest):
for an_edge in digraph.get_edges_for_node(src):
if an_edge.get_destination() == dest:
return an_edge.get_total_distance(), an_edge.get_outdoor_distance()
I was able to solve my problem using this function but i am not sure whether it is the best solution.
Hope this helps.

Prohibitively slow execution of function compute_resilience in Python

The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
queue.append(node)
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
def compute_resilience(ugraph, attack_order):
"""
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
"""
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
ugraph[neighbor].remove(node)
ugraph.pop(node)
res.append(largest_cc_size(ugraph))
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
Review
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
queue.append(node)
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
else:
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
remaining_nodes.difference_update(visited)
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes.difference_update(visited)
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
res.append(largest_cc_size(ugraph))
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
Returns:
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
"""
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
nodes.difference_update(component)
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
"""
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
graph[neighbor].remove(node)
graph.pop(node)
component.remove(node)
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
canonical.update(canon)
components.update(comp)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
"""
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
graph[i].add(j)
graph[j].add(i)
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
0.28782312001567334
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
59.46968446299434
So the revised code is about 200 times faster on this test case.

Finding the (guaranteed unique) path between two nodes in a tree

I have a (likely) simple graph traversal question. I'm a graph newbie using networkx as my graph data structures. My graphs always look like this:
0
1 8
2 3 9 10
4 5 6 7 11 12 13 14
I need to return the path from the root node to a given node (eg., path(0, 11) should return [0, 8, 9, 11]).
I have a solution that works by passing along a list which grows and shrinks to keep track of what the path looks like as you traverse the tree, ultimately getting returned when the target node is found:
def VisitNode(self, node, target, path):
path.append(node)
# Base case. If we found the target, then notify the stack that we're done.
if node == target:
return True
else:
# If we're at a leaf and it isn't the target, then pop the leaf off
# our path (backtrack) and notify the stack that we're still looking
if len(self.neighbors(node)) == 0:
path.pop()
return False
else:
# Sniff down the next available neighboring node
for i in self.neighbors_iter(node):
# If this next node is the target, then return the path
# we've constructed so far
if self.VisitNode(i, target, path):
return path
# If we've gotten this far without finding the target,
# then this whole branch is a dud. Backtrack
path.pop()
I feel in my bones that there is no need for passing around this "path" list... I should be able to keep track of that information using the call stack, but I can't figure out how... Could someone enlighten me on how you would solve this problem recursively using the stack to keep track of the path?
You could avoid passing around the path by returning None on failure, and a partial path on success. In this way, you do not keep some sort of 'breadcrumb trail' from the root to the current node, but you only construct a path from the target back to the root if you find it. Untested code:
def VisitNode(self, node, target):
# Base case. If we found the target, return target in a list
if node == target:
return [node]
# If we're at a leaf and it isn't the target, return None
if len(self.neighbors(node)) == 0:
return None
# recursively iterate over children
for i in self.neighbors_iter(node):
tail = self.VisitNode(i, target)
if tail: # is not None
return [node] + tail # prepend node to path back from target
return None #none of the children contains target
I don't know the graph library you are using, but I assume that even leafs contain a neighbours_iter method, which obviously shouldn't yield any children for a leaf. In this case, you can leave out the explicit check for a leaf, making it a bit shorter:
def VisitNode(self, node, target):
# Base case. If we found the target, return target in a list
if node == target:
return [node]
# recursively iterate over children
for i in self.neighbors_iter(node):
tail = self.VisitNode(i, target)
if tail: # is not None
return [node] + tail # prepend node to path back from target
return None # leaf node or none of the child contains target
I also removed some of the else statements, since inside the true-part of the if you are returning from the function. This is common refactering pattern (which some old-school people don't like). This removes some unnecessary indentation.
You can avoid your path argument at all having path initialized in the method's body. If method returns before finding a full path, it may return an empty list.
But your question is also about using a stack instead of a list in Depth-First-search implementation, right? You get a flavor here: http://en.literateprograms.org/Depth-first_search_%28Python%29.
In a nutshell, you
def depthFirstSearch(start, isGoal, result):
###ensure we're not stuck in a cycle
result.append(start)
###check if we've found the goal
###expand each child node in order, returning if we find the goal
# No path was found
result.pop()
return False
with
###<<expand each child node in order, returning if we find the goal>>=
for v in start.successors:
if depthFirstSearch(v, isGoal, result):
return True
and
###<<check if we've found the goal>>=
if isGoal(start):
return True
Use networkx directly:
all_simple_paths(G, source, target, cutoff=None)

Enumerating all paths in a tree

I was wondering how to best implement a tree data structure to be able to enumerate paths of all levels. Let me explain it with the following example:
A
/ \
B C
| /\
D E F
I want to be able to generate the following:
A
B
C
D
E
F
A-B
A-C
B-D
C-E
C-F
A-B-D
A-C-E
A-C-F
As of now, I am executing a depth-first-search for different depths on a data structure built using a dictionary and recording unique nodes that are seen but I was wondering if there is a better way to do this kind of a traversal. Any suggestions?
Whenever you find a problem on trees, just use recursion :D
def paths(tree):
#Helper function
#receives a tree and
#returns all paths that have this node as root and all other paths
if tree is the empty tree:
return ([], [])
else: #tree is a node
root = tree.value
rooted_paths = [[root]]
unrooted_paths = []
for subtree in tree.children:
(useable, unueseable) = paths(subtree)
for path in useable:
unrooted_paths.append(path)
rooted_paths.append([root]+path)
for path in unuseable:
unrooted_paths.append(path)
return (rooted_paths, unrooted_paths)
def the_function_you_use_in_the_end(tree):
a,b = paths(tree)
return a+b
Just one more way:
Every path without repetitions in tree is uniquely described by its start and finish.
So one of ways to enumerate paths is to enumerate every possible pair of vertices. For each pair it's relatively easy to find path (find common ancestor and go through it).
Find a path to each node of the tree using depth first search, then call enumerate-paths(Path p), where p is the path from the root to the node. Let's assume that a path p is an array of nodes p[0] [1] .. p[n] where p[0] is the root and p[n] is the current node.
enumerate-paths(p) {
for i = 0 .. n
output p[n - i] .. p[n] as a path.
}
Each of these paths is different, and each of them is different from the results returned from any other node of the tree since no other paths end in p[n]. Clearly it is complete, since any path is from a node to some node between it and the root. It is also optimal, since it finds and outputs each path exactly once.
The order will be slightly different from yours, but you could always create an array of list of paths where A[x] is a List of the paths of length x. Then you could output the paths in order of their length, although this would take O(n) storage.

Categories