Most optimal topological sort - python

I found a solution to this topological sorting question, but it's not a solution I came across in any of my research, leading me to believe it's not the most optimal. The question (from algoexpert) reads along the lines of: "return one of the possible graph traversals given a graph where each node represents a job and each edge represents that job's prereq. First param is a list of numbers representing the jobs, second param is a list of arrays (size 2) where the first number in the array represents the prereq to the job being the second number. For example, inputs([1,2,3], [[1,3],[2,1],[2,3]]) => [2, 1, 3]. Note: the graph may not be acyclical, which the algorithm should then return an empty array. Example, inputs([1,2], [[1,2],[2,1]]) => [].
A popular optimal solution is a bit confusing to me, as I've tried implementing it, but keep getting situations where my algorithm detects a cycle and short-circuit returns an empty array. This algorithm "works backwards" in a depth-first manner, with "in-progress" and "visited" nodes kept in memory while searching the graph.
My algorithm initially finds graph nodes with no prereqs (as these can be immediately added to the return array), and removes this node from all other nodes prereqs. While this removal happens, if this node now has 0 prereqs, add it to the stack. When the stack size reaches 0, return the return array if its size matches the size of the first param (jobs list), otherwise return an empty array, which in this case means that a cycle was present in the graph. Here's the code for my algorithm:
def topologicalSort(jobs, relations):
rtn = []
jobsHash = {}
stackSet = set()
for job in jobs:
stackSet.add(job)
for relation in relations:
if relation[1] in stackSet:
stackSet.remove(relation[1])
if relation[0] not in jobsHash:
jobsHash[relation[0]] = {"prereqs": set(), "depends": set()}
jobsHash[relation[0]]["depends"].add(relation[1])
if relation[1] not in jobsHash:
jobsHash[relation[1]] = {"prereqs": set(), "depends": set()}
jobsHash[relation[1]]["prereqs"].add(relation[0])
if jobsHash[relation[0]]["prereqs"].__contains__(relation[1]): # 2 node cycle shortcut
return []
stack = []
for job in stackSet:
stack.append(job)
while len(stack):
job = stack.pop()
rtn.append(job)
clearDepends(jobsHash, job, stack)
if len(rtn) == len(jobs):
return rtn
else:
return []
def clearDepends(jobsHash, job, stack):
if job in jobsHash:
for dependJob in jobsHash[job]["depends"]:
jobsHash[dependJob]["prereqs"].remove(job)
if not len(jobsHash[dependJob]["prereqs"]):
stack.append(dependJob)
jobsHash[job]["depends"] = set()
print(topologicalSort([1,2,3,4],[[1,2],[1,3],[3,2],[4,2],[4,3]]))
I found this algorithm to have a time complexity of O(j + d) and space complexity of O(j + d), which is on par to the popular algorithms attributes as well. My question is, did I find the correct complexities, and is this an optimal solution to this problem. Thanks!

Related

What is best: Global Variable or Parameter in this python function?

I have a question about the following code, but i guess applies to different functions.
This function computes the maximum path and its length for a DAG, given the Graph, source node, and end node.
To keep track of already computed distances across recursions I use "max_distances_and_paths" variable, and update it on each recursion.
Is it better to keep it as a function parameter (inputed and outputed across recursions) or
use a global variable and initialize it outside the function?
How can avoid to have this parameter returned when calling the function externally (i.e it
has to be outputed across recursions but I dont care about its value, externally)?
a better way than doing: LongestPath(G, source, end)[0:2] ??
Thanks
# for a DAG computes maximum distance and maximum path nodes sequence (ordered in reverse).
# Recursively computes the paths and distances to edges which are adjacent to the end node
# and selects the maximum one
# It will return a single maximum path (and its distance) even if there are different paths
# with same max distance
# Input {Node 1: adj nodes directed to Node 1 ... Node N: adj nodes directed to Node N}
# Example: {'g': ['r'], 'k': ['g', 'r']})
def LongestPath(G, source, end, max_distances_and_paths=None):
if max_distances_and_paths is None:
max_distances_and_paths = {}
max_path = [end]
distances_list = []
paths_list = []
# return max_distance and max_path from source to current "end" if already computed (i.e.
# present in the dictionary tracking maximum distances and correspondent distances)
if end in max_distances_and_paths:
return max_distances_and_paths[end][0], max_distances_and_paths[end][1], max_distances_and_paths
# base case, when end node equals source node
if source == end:
max_distance = 0
return max_distance, max_path, max_distances_and_paths
# if there are no adjacent nodes directed to end node (and is not the source node, previous case)
# means path is disconnected
if len(G[end]) == 0:
return 0, [0], {"": []}
# for each adjacent node pointing to end node compute recursively its max distance to source node
# and add one to get the distance to end node. Recursively add nodes included in the path
for t in G[end]:
sub_distance, sub_path, max_distances_and_paths = LongestPath(G, source, t, max_distances_and_paths)
paths_list += [[end] + sub_path]
distances_list += [1 + sub_distance]
# compute max distance
max_distance = max(distances_list)
# access the same index where max_distance is, in the list of paths, to retrieve the path
# correspondent to the max distance
index = [i for i, x in enumerate(distances_list) if x == max_distance][0]
max_path = paths_list[index]
# update the dictionary tracking maximum distances and correspondent paths from source
# node to current end node.
max_distances_and_paths.update({end: [max_distance, max_path]})
# return computed max distance, correspondent path, and tracker
return max_distance, max_path, max_distances_and_paths
Global variables are generally avoided due to several reasons (see Why are global variables evil?). I would recommend sending the parameter in this case. However, you could define a larger function housing your recursive function. Here's a quick example I wrote for a factorial code:
def a(m):
def b(m):
if m<1:return 1
return m*b(m-1)
n = b(m)
m=m+2
return n,m
print(a(6))
This will give: (720, 8). This proves that even if you used the same variable name in your recursive function, the one you passed in to the larger function will not change. In your case, you want to just return n as per my example. I only returned an edited m value to show that even though both a and b functions have m as their input, Python separates them.
In general I would say avoid the usage of global variables. This is because is makes you code harder to read and often more difficult to debug if you codebase gets a bit more complex. So it is good practice.
I would use a helper function to initialise your recursion.
def longest_path_helper(G, source, end, max_distances_and_paths=None):
max_distance, max_path, max_distances_and_paths = LongestPath(
G, source, end, max_distances_and_paths
)
return max_distance, max_path, max_distances_and_paths
On a side note, in Python it is convention to write functions without capital letters and separated with underscores and Capicalized without underscores are used for classes. So it would be more Pythonic to use def longest_path():

Prohibitively slow execution of function compute_resilience in Python

The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
queue.append(node)
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
def compute_resilience(ugraph, attack_order):
"""
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
"""
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
ugraph[neighbor].remove(node)
ugraph.pop(node)
res.append(largest_cc_size(ugraph))
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
Review
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
queue.append(node)
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
else:
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
remaining_nodes.difference_update(visited)
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes.difference_update(visited)
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
res.append(largest_cc_size(ugraph))
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
Returns:
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
"""
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
nodes.difference_update(component)
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
"""
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
graph[neighbor].remove(node)
graph.pop(node)
component.remove(node)
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
canonical.update(canon)
components.update(comp)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
"""
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
graph[i].add(j)
graph[j].add(i)
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
0.28782312001567334
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
59.46968446299434
So the revised code is about 200 times faster on this test case.

Python - Networkx search predecessor nodes - Maximum depth exceeded

I'm working in a project using the library Networkx ( for graph management ) in Python, and I been having trouble trying to implement what I need
I have a collection of directed graphs, holding special objects as nodes and weights associated with the edges, the thing is I need to go through the graph from output nodes to input nodes. and for each node I have to take the weights from their predecessors and an operation calculated by that predecessor node to build the operation form my output node. But the problem is that the operations of the predecessors may depend from their own predecessors, and so on, so I'm wondering how I can solve this problem.
So far I have try the next, lets say I have a list of my output nodes and I can go through the predecessors using the methods of the Networkx library:
# graph is the object containig my directe graph
for node in outputNodes:
activate_predecessors(node , graph)
# ...and a function to activate the predecessors ..
def activate_predecessors( node = None , graph ):
ws = [] # a list for the weight
res = [] # a list for the response from the predecessor
for pred in graph.predecessors( node ):
# get the weights
ws.append( graph[pred][node]['weight'] )
activate_predecessors( pred , graph )
res.append( pred.getResp() ) # append the response from my predecessor node to a list, but this response depend on their own predecessors, so i call this function over the current predecessor in a recursive way
# after I have the two lists ( weights and the response the node should calculate a reduce operation
# do after turning those lists into numpy arrays...
node.response = np.sum( ws*res )
This code seems to work... I tried it on in some random many times, but in many occasions it gives a maximum recursion depth exceeded so I need to rewrite it in a more stable ( and possibly iterative ) way in order to avoid maximum recursion. but I'm running out of ideas to handle this..
the library has some searching algorithms (Depth first search) but after I don't know how it could help me to solve this.
I also try to put some flags on the nodes to know if it had been already activated but I keep getting the same error.
Edit: I forgot, the input nodes have a defined response value so they don't need to do calculations.
your code may contain an infinite recursion if there is a cycle between two nodes. for example:
import networkx as nx
G = nx.DiGraph()
G.add_edges_from([(1,2), (2,1)])
def activate_nodes(g, node):
for pred in g.predecessors(node):
activate_nodes(g, pred)
activate_nodes(G, 1)
RuntimeError: maximum recursion depth exceeded
if you have possible cycles on one of the graphs you better mark each node as visited or change the edges on the graph to have no cycles.
assuming you do not have cycles on your graphs here is an example of how to implement the algorithm iteratively:
import networkx as nx
G = nx.DiGraph()
G.add_nodes_from([1,2,3])
G.add_edges_from([(2, 1), (3, 1), (2, 3)])
G.node[1]['weight'] = 1
G.node[2]['weight'] = 2
G.node[3]['weight'] = 3
def activate_node(g, start_node):
stack = [start_node]
ws = []
while stack:
node = stack.pop()
preds = g.predecessors(node)
stack += preds
print('%s -> %s' % (node, preds))
for pred in preds:
ws.append(g.node[pred]['weight'])
print('weights: %r' % ws)
return sum(ws)
print('total sum %d' % activate_node(G, 1))
this code prints:
1 -> [2, 3]
3 -> [2]
2 -> []
2 -> []
weights: [2, 3, 2]
total sum 7
Note
you can reverse the direction of the directed graph using DiGraph.reverse()
if you need to use DFS or something else you can reverse the graph to get the predecessor as just the directly connected neighbours of that node. Using this, algorithms like DFS might be easier to use.

Finding Successors of Successors in a Directed Graph in NetworkX

I'm working on some code for a directed graph in NetworkX, and have hit a block that's likely the result of my questionable programming experience. What I'm trying to do is the following:
I have a directed graph G, with two "parent nodes" at the top, from which all other nodes flow. When graphing this network, I'd like to graph every node that is a descendant of "Parent 1" one color, and all the other nodes another color. Which means I need a list Parent 1's successors.
Right now, I can get the first layer of them easily using:
descend= G.successors(parent1)
The problem is this only gives me the first generation of successors. Preferably, I want the successors of successors, the successors of the successors of the successors, etc. Arbitrarily, because it would be extremely useful to be able to run the analysis and make the graph without having to know exactly how many generations are in it.
Any idea how to approach this?
You don't need a list of descendents, you just want to color them. For that you just have to pick a algorithm that traverses the graph and use it to color the edges.
For example, you can do
from networkx.algorithms.traversal.depth_first_search import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
See https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.traversal.depth_first_search.dfs_edges.html?highlight=traversal
If you want to get all the successor nodes, without passing through edges, another way could be:
import networkx as nx
G = DiGraph( ... )
successors = nx.nodes(nx.dfs_tree(G, your_node))
I noticed that if you call instead:
successors = list(nx.dfs_successors(G, your_node)
the nodes of the bottom level are somehow not included.
Well, the successor of successor is just the successor of the descendants right?
# First successors
descend = G.successors(parent1)
# 2nd level successors
def allDescendants(d1):
d2 = []
for d in d1:
d2 += G.successors(d)
return d2
descend2 = allDescendants(descend)
To get level 3 descendants, call allDescendants(d2) etc.
Edit:
Issue 1:
allDescend = descend + descend2 gives you the two sets combined, do the same for further levels of descendants.
Issue2: If you have loops in your graph, then you need to first modify the code to test if you've visited that descendant before, e.g:
def allDescendants(d1, exclude):
d2 = []
for d in d1:
d2 += filter(lambda s: s not in exclude, G.successors(d))
return d2
This way, you pass allDescend as the second argument to the above function so it's not included in future descendants. You keep doing this until allDescandants() returns an empty array in which case you know you've explored the entire graph, and you stop.
Since this is starting to look like homework, I'll let you figure out how to piece all this together on your own. ;)
So that the answer is somewhat cleaner and easier to find for future folks who stumble upon it, here's the code I ended up using:
G = DiGraph() # Creates an empty directed graph G
infile = open(sys.argv[1])
for edge in infile:
edge1, edge2 = edge.split() #Splits data on the space
node1 = int(edge1) #Creates integer version of the node names
node2 = int(edge2)
G.add_edge(node1,node2) #Adds an edge between two nodes
parent1=int(sys.argv[2])
parent2=int(sys.argv[3])
data_successors = dfs_successors(G,parent1)
successor_list = data_successors.values()
allsuccessors = [item for sublist in successor_list for item in sublist]
pos = graphviz_layout(G,prog='dot')
plt.figure(dpi=300)
draw_networkx_nodes(G,pos,node_color="LightCoral")
draw_networkx_nodes(G,pos,nodelist=allsuccessors, node_color="SkyBlue")
draw_networkx_edges(G,pos,arrows=False)
draw_networkx_labels(G,pos,font_size=6,font_family='sans-serif',labels=labels)
I believe Networkx has changed since #Jochen Ritzel 's answer a few years ago.
Now the following holds, only changing the import statement.
import networkx
from networkx import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
Oneliner:
descendents = sum(nx.dfs_successors(G, parent).values(), [])
nx.descendants(G, parent)
more details: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.descendants.html

Reprioritizing priority queue (efficient manner)

I'm looking for a more efficient way to reprioritize items in a priority queue. I have a (quite naive) priority queue implementation based on heapq. The relevant parts are like:
from heapq import heapify, heappop
class pq(object):
def __init__(self, init= None):
self.inner, self.item_f= [], {}
if not None is init:
self.inner= [[priority, item] for item, priority in enumerate(init)]
heapify(self.inner)
self.item_f= {pi[1]: pi for pi in self.inner}
def top_one(self):
if not len(self.inner): return None
priority, item= heappop(self.inner)
del self.item_f[item]
return item, priority
def re_prioritize(self, items, prioritizer= lambda x: x+ 1):
for item in items:
if not item in self.item_f: continue
entry= self.item_f[item]
entry[0]= prioritizer(entry[0])
heapify(self.inner)
And here is a simple co-routine to just demonstrate the reprioritize characteristics in my real application.
def fecther(priorities, prioritizer= lambda x: x+ 1):
q= pq(priorities)
for k in xrange(len(priorities)+ 1):
items= (yield k, q.top_one())
if not None is items:
q.re_prioritize(items, prioritizer)
With testing
if __name__ == '__main__':
def gen_tst(n= 3):
priorities= range(n)
priorities.reverse()
priorities= priorities+ range(n)
def tst():
result, f= range(2* n), fecther(priorities)
k, item_t= f.next()
while not None is item_t:
result[k]= item_t[0]
k, item_t= f.send(range(item_t[0]))
return result
return tst
producing:
In []: gen_tst()()
Out[]: [2, 3, 4, 5, 1, 0]
In []: t= gen_tst(123)
In []: %timeit t()
10 loops, best of 3: 26 ms per loop
Now, my question is, does there exist any data-structure which would avoid calls to heapify(.), when repriorizating the priority queue? I'm here willing to trade memory for speed, but it should be possible to implement it in pure Python (obviously with much more better timings than my naive implementation).
Update:
In order to let you to understand more on the specific case, lets assume that no items are added to the queue after initial (batch) pushes and then every fetch (pop) from the queue will generate number of repriorizations roughly like this scheme:
0* n, very seldom
0.05* n, typically
n, very seldom
where n is the current number of itemsin queue. Thus, in any round, there are more or less only relative few items to repriorizate. So I'm hoping that there could exist a data-structure that would be able to exploit this pattern and therefore outperforming the cost of doing mandatory heapify(.) in every round (in order to satisfy the heap invariant).
Update 2:
So far it seems that the heapify(.) approach is quite efficient (relatively speaking) indeed. All the alternatives I have been able to figure out, needs to utilize heappush(.) and it seems to be more expensive what I originally anticipated. (Anyway, if the state of issue remains like this, I'm forced to find a better solution out of the python realm).
Since the new prioritization function may have no relationship to the previous one, you have to pay the cost to get the new ordering (and it's at minimum O(n) just to find the minimum element in the new ordering). If you have a small, fixed number of prioritization functions and switch frequently between them, then you could benefit from keeping a separate heap going for each function (although not with heapq, because it doesn't support cheaply locating and removing and object from the middle of a heap).

Categories