Raising performance of BFS in Python

Raising performance of BFS in Python - python

How can I increase speed performance of below Python code?
My code works okay which means no errors but the performance of this code is very slow.
The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)
Problem definition:
Check if the distance between two nodes are less than max_distance
My constraints:
I have to import a .txt file of which format is like sample_input
Expected ouput is like
sample_output
Totall code runtime should be less than 5 secs.
Can anyone give me an advice to improve my code much better? Follow my code:
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
self.graph[key] = self.graph.get(key, []) + [val]
def check_distance(self, x, y, max_distance):
dist = self.path(x, y, max_distance)
if dist:
return dist - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
Thank you for your help in advance.

Several pointers:
if you call check distance more than once you have to recreate the graph
calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
as DarrylG points out you can exit from the BFS early once a path has exceed the max distance
you could try
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = self.file_to_graph()
def file_to_graph(self):
graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
graph[key] = graph.get(key, []) + [val]
return graph
def check_distance(self, x, y, max_distance):
path_length = self.path(x, y, max_distance)
if path_length:
return len(path) - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
# we have explored all paths shorter than the max distance
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
As to why pop(0) is inefficient - from the docs:
Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

About the approach:
You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V| at every time or, you need to calculate every time the distances again and again. This is not a good approach.
What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step
About the algorithms
I recommend you to look for different approaches of DFS/BFS.
If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.
Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)
As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.
If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.
The next step is collect from memory the distances that you can put in an array.
You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.

Related

Dijkstra algorithm not working even though passes the sample test cases

So I have followed Wikipedia's pseudocode for Dijkstra's algorithm as well as Brilliants. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Pseudocode https://brilliant.org/wiki/dijkstras-short-path-finder/. Here is my code which doesn't work. Can anyone point in the flaw in my code?
# Uses python3
from queue import Queue
n, m = map(int, input().split())
adj = [[] for i in range(n)]
for i in range(m):
u, v, w = map(int, input().split())
adj[u-1].append([v, w])
adj[v-1].append([u, w])
x, y = map(int, input().split())
x, y = x-1, y-1
q = [i for i in range(n, 0, -1)]
#visited = set()
# visited.add(x+1)
dist = [float('inf') for i in range(len(adj))]
dist[x] = 0
# print(adj[visiting])
while len(q) != 0:
visiting = q.pop()-1
for i in adj[visiting]:
u, v = i
dist[u-1] = dist[visiting]+v if dist[visiting] + \
v < dist[u-1] else dist[u-1]
# print(dist)
if dist[y] != float('inf'):
print(dist[y])
else:
print(-1)

Your algorithm is not implementing Dijkstra's algorithm correctly. You are just iterating over all nodes in their input order and updating the distance to the neighbors based on the node's current distance. But that latter distance is not guaranteed to be the shortest distance, because you iterate some nodes before their "turn". Dijkstra's algorithm specifies a particular order of processing nodes, which is not necessarily the input order.
The main ingredient that is missing from your algorithm, is a priority queue. You did import from Queue, but never use it. Also, it lacks the marking of nodes as visited, a concept which you seemed to have implemented a bit, but which you commented out.
The outline of the algorithm on Wikipedia explains the use of this priority queue in the last step of each iteration:
Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new "current node", and go back to step 3.
There is currently no mechanism in your code that selects the visited node with smallest distance. Instead it picks the next node based on the order in the input.
To correct your code, please consult the pseudo code that is available on that same Wikipedia page, and I would advise to go for the variant with priority queue.
In Python you can use heapq for performing the actions on the priority queue (heappush, heappop).

How to Analyze DAG Time Complexity?

I am learning about topological sort, and graphs in general. I implemented a version below using DFS but I am having trouble understanding why the wikipedia page says this is O(|V|+|E|) and analyzing its time complexity, and the difference between |V|+|E| and n^2 in general.
Firstly, I have two for loops, logic says that it would be (n^2) but also isnt it true that in any DAG(or Tree), there is n-1 edges, and n vertexes? How is this any different from n^2 if we can remove the "-1" for non significant value?
graph = {
1:[4, 5, 7],
2:[3,5,6],
3:[4],
4:[5],
5:[6,7],
6:[7],
7:[]
}
from collections import defaultdict
def topological_sort(graph):
ordered, marked = [], defaultdict(int)
while len(ordered) < len(graph):
for vertex in graph:
if marked[vertex]==0:
visit(graph, vertex, ordered, marked)
return ordered
def visit(graph, n, ordered, marked):
if marked[n] == 1:
raise 'Not a DAG'
marked[n] = 1
for neighbor in graph.get(n):
if marked[neighbor]!=2:
visit(graph, neighbor, ordered, marked)
marked[n] = 2
ordered.insert(0, n)
def main():
print(topological_sort(graph))
main()

The proper implementation works in O(|V| + |E|) time because it goes through every edge and every vertex at most once. It's the same thing as O(|V|^2) for a complete (or almost complete graph). However, it's much better when the graph is sparse.
You implementation is O(|V|^2), not O(|V| + |E|). These two nested loops:
while len(ordered) < len(graph):
for vertex in graph:
if marked[vertex]==0:
visit(graph, vertex, ordered, marked)
do 1 + 2 ... + |V| = O(|V|^2) iterations in the worst case (for instance, for an empty graph). You can easily fix by getting rid of the outer loop (it's that simple: just remove the while loop. You don't need it).

Dijkstra's Algorithm With Large Weights

I am trying to solve a question on an online judge about calculating all shortest paths on a complete graph. Full problem specifications can be seen here. However, I am exceeding the memory limit required. Here is the part of the code that does Dijkstra's algorithm:
n = int(raw_input())
dict1 = [[""for i in xrange(n+1)]for j in xrange(n+1)]
edges = [0]
for i in xrange(n):
x,y = map(int, raw_input().split())
edges.append((x,y))
for i,coord1 in enumerate(edges):
for j,coord2 in enumerate(edges):
if i==j or i==0 or j==0:
continue
x1,y1 = coord1
x2,y2 = coord2
weight = (x1-x2)*(x1-x2) + (y1-y2)*(y1-y2)
dict1[i][j] = weight
dict1[j][i] = weight
x = int(raw_input())
times = []
vertices = {i:1e13 if i!= x else 0 for i in xrange(1,n+1)}
while len(vertices)>0:
minimum = min(vertices.items(), key=lambda x: x[1])[0]
currentCost = vertices[minimum]
times.append(currentCost)
del vertices[minimum]
for neighbour,newWeight in enumerate(dict1[minimum]):
if neighbour in vertices and newWeight != "":
if currentCost + newWeight < vertices[neighbour]:
vertices[neighbour] = currentCost + newWeight
The code uses the original algorithm without the priority queue because of the better time complexity. Even though this gives the right answer, I have a feeling the memory exceeding has something to do with the way I am storing the weights, considering they can be as large as 10^12. Is there another way I can store the weights that will use less memory, or is something else causing the problem?

Your problem has nothing to do with big weights (10^12 is not a big number). If you want to see that this is the case (try dividing them by some number like 1000 to see that it will fail as well).
The problem is that you do not use priority queue and this deteriorate the time complexity to O(V^2) and if you will use a priority queue, you will get O(E + V log(V)).
So implement a normal Dijkstra and will get your answer accepted.
Sorry, have not read that this is a planar graph and that it is dense. Knowing that your graph consists of 2d points, you can take advantage of the distance heuristics and use A* algorithm.

Too slow queries in interval tree

I have a list of intervals and I need to return the ones that overlap with an interval passed in a query. What is special is that in a typical query around a third or even half of the intervals will overlap with the one given in the query. Also, the ratio of the shortest interval to the longest is not more than 1:5. I implemented my own interval tree (augmented red-black tree) - I did not want to use existing implementations because I needed support for closed intervals and some special features. I tested the query speed with 6000 queries in a tree with 6000 intervals (so n=6000 and m=3000 (app.)). It turned out that brute force is just as good as using the tree:
Computation time - loop: 125.220461 s
Tree setup: 0.05064 s
Tree Queries: 123.167337 s
Let me use asymptotic analysis. n: number of queries; n: number of intervals; app. n/2: number of intervals returned in a query:
time complexity brute force: n*n
time complexity tree: n*(log(n)+n/2) --> 1/2 nn + nlog(n) --> n*n
So the result is saying that the two should be roughly the same for a large n. Still one would somehow expect the tree to be noticeably faster given the constant 1/2 in front of n*n. So there are three possible reasons I can imagine for the results I got:
a) My implementation is wrong. (Should I be using BFS like below?)
b) My implementation is right, but I made things cumbersome for Python so it needs more time to deal with the tree than to deal with brute force.
c) everything is OK - it is just how things should behave for a large n
My query function looks like this:
from collections import deque
def query(self,low,high):
result = []
q = deque([self.root]) # this is the root node in the tree
append_result = result.append
append_q = q.append
pop_left = q.popleft
while q:
node = pop_left() # look at the next node
if node.overlap(low,high): # some overlap?
append_result(node.interval)
if node.low != None and low <= node.get_low_max(): # en-q left node
append_q(node.low)
if node.high != None and node.get_high_min() <= high: # en-q right node
append_q(node.high)
I build the tree like this:
def build(self, intervals):
"""
Function which is recursively called to build the tree.
"""
if intervals is None:
return None
if len(intervals) > 2: # intervals is always sorted in increasing order
mid = len(intervals)//2
# split intervals into three parts:
# central element (median)
center = intervals[mid]
# left half (<= median)
new_low = intervals[:mid]
#right half (>= median)
new_high = intervals[mid+1:]
#compute max on the lower side (left):
max_low = max([n.get_high() for n in new_low])
#store min on the higher side (right):
min_high = new_high[0].get_low()
elif len(intervals) == 2:
center = intervals[1]
new_low = [intervals[0]]
new_high = None
max_low = intervals[0].get_high()
min_high = None
elif len(intervals) == 1:
center = intervals[0]
new_low = None
new_high = None
max_low = None
min_high = None
else:
raise Exception('The tree is not behaving as it should...')
return(Node(center, self.build(new_low),self.build(new_high),
max_low, min_high))
EDIT:
A node is represented like this:
class Node:
def __init__(self, interval, low, high, max_low, min_high):
self.interval = interval # pointer to corresponding interval object
self.low = low # pointer to node containing intervals to the left
self.high = high # pointer to node containing intervals to the right
self.max_low = max_low # maxiumum value on the left side
self.min_high = min_high # minimum value on the right side
All the nodes in a subtree can be obtained like this:
def subtree(current):
node_list = []
if current.low != None:
node_list += subtree(current.low)
node_list += [current]
if current.high != None:
node_list += subtree(current.high)
return node_list
p.s. note that by exploiting that there is so much overlap and that all intervals have comparable lenghts, I managed to implement a simple method based on sorting and bisection that completed in 80 s, but I would say this is over-fitting... Amusingly, by using asymptotic analysis, I found it should have app. the same runtime as using the tree...

If I correctly understand your problem, you are trying to speed up your process.
If it is that, try to create a real tree instead of manipulating lists.
Something that looks like :
class IntervalTreeNode():
def __init__(self, parent, min, max):
self.value = (min,max)
self.parent = parent
self.leftBranch = None
self.rightBranch= None
def insert(self, interval):
...
def asList(self):
""" return the list that is this node and all the subtree nodes """
left=[]
if (self.leftBranch != None):
left = self.leftBranch.asList()
right=[]
if (self.rightBranch != None):
left = self.rightBranch.asList()
return [self.value] + left + right
And then at start create an internalTreeNode and insert all yours intervals in.
This way, if you really need a list you can build a list each time you need a result and not each time you make a step in your recursive iteration using [x:] or [:x] as list manipulation is a costly operation in python. It is possible to work also using directly the nodes instead of a list that will greatly speed up the process as you just have to return a reference to the node instead of doing some list addition.

K-th order neighbors in graph - Python networkx

I have a directed graph in which I want to efficiently find a list of all K-th order neighbors of a node. K-th order neighbors are defined as all nodes which can be reached from the node in question in exactly K hops.
I looked at networkx and the only function relevant was neighbors. However, this just returns the order 1 neighbors. For higher order, we need to iterate to determine the full set. I believe there should be a more efficient way of accessing K-th order neighbors in networkx.
Is there a function which efficiently returns the K-th order neighbors, without incrementally building the set?
EDIT: In case there exist other graph libraries in Python which might be useful here, please do mention those.

You can use:
nx.single_source_shortest_path_length(G, node, cutoff=K)
where G is your graph object.

For NetworkX the best method is probably to build the set of neighbors at each k. You didn't post your code but it seems you probably already have done this:
import networkx as nx
def knbrs(G, start, k):
nbrs = set([start])
for l in range(k):
nbrs = set((nbr for n in nbrs for nbr in G[n]))
return nbrs
if __name__ == '__main__':
G = nx.gnp_random_graph(50,0.1,directed=True)
print(knbrs(G, 0, 3))

Yes,you can get a k-order ego_graph of a node
subgraph = nx.ego_graph(G,node,radius=k)
then neighbors are nodes of the subgraph
neighbors= list(subgraph.nodes())

I had a similar problem, except that I had a digraph, and I need to maintain the edge-attribute dictionary. This mutual-recursion solution keeps the edge-attribute dictionary if you need that.
def neighbors_n(G, root, n):
E = nx.DiGraph()
def n_tree(tree, n_remain):
neighbors_dict = G[tree]
for neighbor, relations in neighbors_dict.iteritems():
E.add_edge(tree, neighbor, rel=relations['rel'])
#you can use this map if you want to retain functional purity
#map(lambda neigh_rel: E.add_edge(tree, neigh_rel[0], rel=neigh_rel[1]['rel']), neighbors_dict.iteritems() )
neighbors = list(neighbors_dict.iterkeys())
n_forest(neighbors, n_remain= (n_remain - 1))
def n_forest(forest, n_remain):
if n_remain <= 0:
return
else:
map(lambda tree: n_tree(tree, n_remain=n_remain), forest)
n_forest( [root] , n)
return E

You solve your problem using modified BFS algorithm. When you're storing node in queue, store it's level (distance from root) as well. When you finish processing the node (all neighbours visited - node marked as black) you can add it to list of nodes of its level. Here is example based on this simple implementation:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from collections import defaultdict
from collections import deque
kth_step = defaultdict(list)
class BFS:
def __init__(self, node,edges, source):
self.node = node
self.edges = edges
self.source = source
self.color=['W' for i in range(0,node)] # W for White
self.graph =color=[[False for i in range(0,node)] for j in range(0,node)]
self.queue = deque()
# Start BFS algorithm
self.construct_graph()
self.bfs_traversal()
def construct_graph(self):
for u,v in self.edges:
self.graph[u][v], self.graph[v][u] = True, True
def bfs_traversal(self):
self.queue.append((self.source, 1))
self.color[self.source] = 'B' # B for Black
kth_step[0].append(self.source)
while len(self.queue):
u, level = self.queue.popleft()
if level > 5: # limit searching there
return
for v in range(0, self.node):
if self.graph[u][v] == True and self.color[v]=='W':
self.color[v]='B'
kth_step[level].append(v)
self.queue.append((v, level+1))
'''
0 -- 1---7
| |
| |
2----3---5---6
|
|
4
'''
node = 8 # 8 nodes from 0 to 7
edges =[(0,1),(1,7),(0,2),(1,3),(2,3),(3,5),(5,6),(2,4)] # bi-directional edge
source = 0 # set fist node (0) as source
bfs = BFS(node, edges, source)
for key, value in kth_step.items():
print key, value
Output:
$ python test.py
0 [0]
1 [1, 2]
2 [3, 7, 4]
3 [5]
4 [6]
I don't know networkx, neither I found ready to use algorithm in Graph Tool. I believe such a problem isn't common enough to have its own function. Also I think it would be overcomplicated, inefficient and redundant to store lists of k-th neighbours for any node in graph instance so such a function would probably have to iterate over nodes anyway.

As proposed previously, the following solution gives you all secondary neighbors (neighbors of neighbors) and lists all neighbors once (the solution is based on BFS):
{n: path for n, path in nx.single_source_shortest_path(G, 'a', cutoff=2).items() if len(path)==3}
Another solution which is slightly faster (6.68 µs ± 191 ns vs. 13.3 µs ± 32.1 ns, measured with timeit) includes that in undirected graphs the neighbor of a neighbor can be the source again:
def k_neighbors(G, source, cutoff):
neighbors = {}
neighbors[0] = {source}
for k in range(1, cutoff+1):
neighbors[k] = set()
for node in level[k-1]:
neighbors[k].update(set(G.neighbors(node)))
return neighbors
k_neighbors(B, 'a', 2) #dict keyed with level until `cutoff`, in this case 2
Both solutions give you the source itself as 0th-order neighbor.
So it depends on your context which one to prefer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.