How to Analyze DAG Time Complexity? - python

I am learning about topological sort, and graphs in general. I implemented a version below using DFS but I am having trouble understanding why the wikipedia page says this is O(|V|+|E|) and analyzing its time complexity, and the difference between |V|+|E| and n^2 in general.
Firstly, I have two for loops, logic says that it would be (n^2) but also isnt it true that in any DAG(or Tree), there is n-1 edges, and n vertexes? How is this any different from n^2 if we can remove the "-1" for non significant value?
graph = {
1:[4, 5, 7],
2:[3,5,6],
3:[4],
4:[5],
5:[6,7],
6:[7],
7:[]
}
from collections import defaultdict
def topological_sort(graph):
ordered, marked = [], defaultdict(int)
while len(ordered) < len(graph):
for vertex in graph:
if marked[vertex]==0:
visit(graph, vertex, ordered, marked)
return ordered
def visit(graph, n, ordered, marked):
if marked[n] == 1:
raise 'Not a DAG'
marked[n] = 1
for neighbor in graph.get(n):
if marked[neighbor]!=2:
visit(graph, neighbor, ordered, marked)
marked[n] = 2
ordered.insert(0, n)
def main():
print(topological_sort(graph))
main()

The proper implementation works in O(|V| + |E|) time because it goes through every edge and every vertex at most once. It's the same thing as O(|V|^2) for a complete (or almost complete graph). However, it's much better when the graph is sparse.
You implementation is O(|V|^2), not O(|V| + |E|). These two nested loops:
while len(ordered) < len(graph):
for vertex in graph:
if marked[vertex]==0:
visit(graph, vertex, ordered, marked)
do 1 + 2 ... + |V| = O(|V|^2) iterations in the worst case (for instance, for an empty graph). You can easily fix by getting rid of the outer loop (it's that simple: just remove the while loop. You don't need it).

Related

Dijkstra algorithm not working even though passes the sample test cases

So I have followed Wikipedia's pseudocode for Dijkstra's algorithm as well as Brilliants. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Pseudocode https://brilliant.org/wiki/dijkstras-short-path-finder/. Here is my code which doesn't work. Can anyone point in the flaw in my code?
# Uses python3
from queue import Queue
n, m = map(int, input().split())
adj = [[] for i in range(n)]
for i in range(m):
u, v, w = map(int, input().split())
adj[u-1].append([v, w])
adj[v-1].append([u, w])
x, y = map(int, input().split())
x, y = x-1, y-1
q = [i for i in range(n, 0, -1)]
#visited = set()
# visited.add(x+1)
dist = [float('inf') for i in range(len(adj))]
dist[x] = 0
# print(adj[visiting])
while len(q) != 0:
visiting = q.pop()-1
for i in adj[visiting]:
u, v = i
dist[u-1] = dist[visiting]+v if dist[visiting] + \
v < dist[u-1] else dist[u-1]
# print(dist)
if dist[y] != float('inf'):
print(dist[y])
else:
print(-1)
Your algorithm is not implementing Dijkstra's algorithm correctly. You are just iterating over all nodes in their input order and updating the distance to the neighbors based on the node's current distance. But that latter distance is not guaranteed to be the shortest distance, because you iterate some nodes before their "turn". Dijkstra's algorithm specifies a particular order of processing nodes, which is not necessarily the input order.
The main ingredient that is missing from your algorithm, is a priority queue. You did import from Queue, but never use it. Also, it lacks the marking of nodes as visited, a concept which you seemed to have implemented a bit, but which you commented out.
The outline of the algorithm on Wikipedia explains the use of this priority queue in the last step of each iteration:
Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new "current node", and go back to step 3.
There is currently no mechanism in your code that selects the visited node with smallest distance. Instead it picks the next node based on the order in the input.
To correct your code, please consult the pseudo code that is available on that same Wikipedia page, and I would advise to go for the variant with priority queue.
In Python you can use heapq for performing the actions on the priority queue (heappush, heappop).

Raising performance of BFS in Python

How can I increase speed performance of below Python code?
My code works okay which means no errors but the performance of this code is very slow.
The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)
Problem definition:
Check if the distance between two nodes are less than max_distance
My constraints:
I have to import a .txt file of which format is like sample_input
Expected ouput is like
sample_output
Totall code runtime should be less than 5 secs.
Can anyone give me an advice to improve my code much better? Follow my code:
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
self.graph[key] = self.graph.get(key, []) + [val]
def check_distance(self, x, y, max_distance):
dist = self.path(x, y, max_distance)
if dist:
return dist - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
Thank you for your help in advance.
Several pointers:
if you call check distance more than once you have to recreate the graph
calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
as DarrylG points out you can exit from the BFS early once a path has exceed the max distance
you could try
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = self.file_to_graph()
def file_to_graph(self):
graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
graph[key] = graph.get(key, []) + [val]
return graph
def check_distance(self, x, y, max_distance):
path_length = self.path(x, y, max_distance)
if path_length:
return len(path) - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
# we have explored all paths shorter than the max distance
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
As to why pop(0) is inefficient - from the docs:
Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.
About the approach:
You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V| at every time or, you need to calculate every time the distances again and again. This is not a good approach.
What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step
About the algorithms
I recommend you to look for different approaches of DFS/BFS.
If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.
Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)
As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.
If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.
The next step is collect from memory the distances that you can put in an array.
You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.

Fastest way of checking if a subgraph is a clique in NetworkX

I want to find out if a given subgraph of G is a complete graph. I was expecting to find a built in function, like is_complete_graph(G), but I can't see anything like that.
My current solution is to create a new helper function:
def is_complete(G):
n = G.order()
return n*(n-1)/2 == G.size()
I imagine this is probably fast but I feel wrong implementing this kind of thing myself, and I feel there must be a 'right' way to do it in NetworkX.
I only need a solution for simple undirected graphs.
edit
The answer at the bottom is relatively clean. However, it appears that the following is faster:
def is_subclique(G,nodelist):
H = G.subgraph(nodelist)
n = len(nodelist)
return H.size() == n*(n-1)/2
I have to admit, I don't entirely understand. But clearly creating the subgraph is faster than checking whether each edge exists.
Slower alternative I expected to be faster:
We'll check to see if all of the edges are there. We'll use combinations to generate the pairs we check. Note that if combinations returns (1,2), then it will not return (2,1).
from itertools import combinations
import networkx as nx
def is_subclique(G, nodelist):
r'''For each pair of nodes in nodelist whether there is an edge
if any edge is missing, we know that it's not a subclique.
if all edges are there, it is a subclique
'''
for (u,v) in combinations(nodelist,2): #check each possible pair
if not G.has_edge(u,v):
return False #if any edge is missing we're done
return True #if we get to here, then every edge was there. It's True.
G = nx.Graph()
G.add_edges_from([(1,2), (2,3), (3,1), (4,1)])
is_subclique(G, [1,2,3])
> True
is_subclique(G, [1,4])
> True
is_subclique(G, [2,3,4])
> False
You actually need to check more than the number of edges because selfloops aren't allowed for a complete_graph. Also, they potentially change the expected count of edges.
Here's a very fast function -- especially if it isn't a complete graph. Notice that it avoids even counting the edges. It just makes sure each node has the right neighbors.
def is_complete_graph(G):
N = len(G) - 1
return not any(n in nbrdict or len(nbrdict)!=N for n, nbrdict in G.adj.items())

Kruskels MST algorithm in python seems to be not quite right

I was implementing the Kruskel's algorithm in Python. But it isn't giving correct answers for dense graphs. Is there any flaw in the logic??
The algo I am using is this:
1) Store all vertices in visited
2) Sort all the edges wrt their weights
3) Keep picking the smallest edge until all vertices, v, have visited[v]=1
This is what I have tried:
def Kruskal(weighted_graph):
visited = {}
for u,v,_ in weighted_graph:
visited[u] = 0
visited[v] = 0
sorted_edges = deque(sorted(weighted_graph, key = lambda x:x[2]))
mstlist = []
sumi=0
while 0 in visited.values():
u,v,w = sorted_edges.popleft()
if visited[u] == 0 or visited[v] == 0:
mstlist.append((u,v))
visited[u] = 1
visited[v] = 1
sumi += w
return (sumi,mstlist)
input is a list of tuples..a single tuple looks like this (source,neighbor,weight)
The Minimum spanning tree sum which I am calculating is coming out to be wrong for dense graphs. Please help. Thank you!
Your condition for adding the edge is if visited[u] == 0 or visited[v] == 0, so you require that one of the adjacent nodes is not connected to any edge you have added to your MST so far. For the algorithm to work correctly, however, it is sometimes necessary to add edges even if you have already "visited" both nodes. Consider this very simple graph:
[
(A, B, 2),
(B, C, 3),
(C, D, 1),
]
visual representation:
[A]---(2)---[B]---(3)---[C]---(1)---[D]
Your algorithm would first add the edge (C, D), marking C and D as visited.
Then, it would add the edge (A, B), marking A and B as visited.
Now, you're only left with the edge (B, C). The MST for this graph obviously contains this edge. But your condition fails -- both B and C are marked as visited. So, your algorithm doesn't add that edge.
In conclusion, you need to replace that check. You should check whether the two nodes that the current edge connects are already connected by the edges that you have added to your MST so far. If they are, you skip it, otherwise you add it.
Usually, disjoint-set data structures are used for implementing this with a good run time complexity (see the pseudocode on wikipedia).
However, your code so far already has bad run time complexity as 0 in visited.values() has to linearly search through the values of the dictionary until it either reaches the end or finds an element with value 0, so it might be enough for you to do something simpler.
You can find some implementations of the algorithm using disjoint-set data structures on the internet, e.g. here.

Creating lists of mutual neighbor elements

Say, I have a set of unique, discrete parameter values, stored in a variable 'para'.
para=[1,2,3,4,5,6,7,8,9,10]
Each element in this list has 'K' number of neighbors (given: each neighbor ϵ para).
EDIT: This 'K' is obviously not the same for each element.
And to clarify the actual size of my problem: I need a neighborhood of close to 50-100 neighbors on average, given that my para list is around 1000 elements large.
NOTE: A neighbor of an element, is another possible 'element value' to which it can jump, by a single mutation.
neighbors_of_1 = [2,4,5,9] #contains all possible neighbors of 1 (i.e para[0])
Question: How can I define each of the other element's
neighbors randomly from 'para', but, keeping in mind the previously
assigned neighbors/relations?
eg:
neighbors_of_5=[1,3,7,10] #contains all possible neighbors of 5 (i.e para[4])
NOTE: '1' has been assigned as a neighbor of '5', keeping the values of 'neighbors_of_1' in mind. They are 'mutual' neighbors.
I know the inefficient way of doing this would be, to keep looping through the previously assigned lists and check if the current state is a neighbor of another state, and if True, store the value of that state as one of the new neighbors.
Is there a cleaner/more pythonic way of doing this? (By maybe using the concept of linked-lists or any other method? Or are lists redundant?)
This solution does what you want, I believe. It is not the most efficient, as it generates quite a bit of extra elements and data, but the run time was still short on my computer and I assume you won't run this repeatedly in a tight, inner loop?
import itertools as itools
import random
# Generating a random para variable:
#para=[1,2,3,4,5,6,7,8,9,10]
para = list(range(10000))
random.shuffle(para)
para = para[:1000]
# Generate all pais in para (in random order)
pairs = [(a,b) for a, b in itools.product(para, para) if a < b]
random.shuffle(pairs)
K = 50 # average number of neighbors
N = len(para)*K//2 # total connections
# Generating a neighbors dict, holding all the neighbors of an element
neighbors = dict()
for elem in para:
neighbors[elem] = []
# append the neighbors to eachother
for pair in pairs[:N]:
neighbors[pair[0]].append(pair[1])
neighbors[pair[1]].append(pair[0])
# sort each neighbor list
for neighbor in neighbors.values():
neighbor.sort()
I hope you understand my solution. Otherwise feel free to ask for a few pointers.
Neighborhood can be represented by a graph. If N is a neighbor of B does not necessarily implies that B is a neighbor of A, it is directed. Else it is undirected. I'm guessing you want a undirected graph since you want to "keep in mind the relationship between the nodes".
Besides the obvious choice of using a third party library for graphs, you can solve your issue by using a set of edges between the graph vertices. Edges can be represented by the pair of their two extremities. Since they are undirected, either you use a tuple (A,B), such that A < B or you use a frozenset((A,B)).
Note there are considerations to take about what neighbor to randomly choose from when in the middle of the algorithm, like discouraging to pick nodes with a lot of neighbor to avoid to go over your limits.
Here is a pseudo-code of what I'd do.
edges = set()
arities = [ 0 for p in para ]
for i in range(len(para)):
p = para[i]
arity = arities[i]
n = random.randrange(50, 100)
k = n
while k > 0:
w = list(map(lambda x : 1/x, arities))
#note: test what random scheme suits you best
j = random.choices(para, weight = w )
#note: I'm storing the vertices index in the edges rather than the nodes.
#But if the nodes are unique, you could store the nodes.
e = frozenset((i,j))
if e not in edges:
edges.add(e)
#instead of arities, you could have a list of list of the neighbours.
#arity[i] would be len(neighbors[i]), then
arities[i] += 1
arities[j] += 1
k-=1

Categories