How to Search 100,000 Edges Fast? - python

I have a network DirectedGraph object G that contains around 2,0000 vertices and 120,000 edges in it.
I'd like to search the edge list and check which edge ends in the vertex 'deny'.(Oh the graph vertices are all an English word.)
I've just stupidly does as following, but it never stops.. I am waiting more than 10 mins from the very beginning.
How could I perform it fast?
for i in range(len(G.edges())):
if list(G.edges())[i][1] == 'deny':
print(list(G.edges())[i])

You can have dictionary of <end vertex, (start vertex, end)>. However, it depends on how frequent these operations would be performed.
If it's just one time operation then I would just go about looping my list otherwise implement the dictionary.

Related

How to generate all directed permutations of an undirected graph?

I am looking for a way to generate all possible directed graphs from an undirected template. For example, given this graph "template":
I want to generate all six of these directed versions:
In other words, for each edge in the template, choose LEFT, RIGHT, or BOTH direction for the resulting edge.
There is a huge number of outputs for even a small graph, because there are 3^E valid permutations (where E is the number of edges in the template graph), but many of them are duplicates (specifically, they are automorphic to another output). Take these two, for example:
I only need one.
I'm curious first: Is there is a term for this operation? This must be a formal and well-understood process already?
And second, is there a more efficient algorithm to produce this list? My current code (Python, NetworkX, though that's not important for the question) looks like this, which has two things I don't like:
I generate all permutations even if they are isomorphic to a previous graph
I check isomorphism at the end, so it adds additional computational cost
Results := Empty List
T := The Template (Undirected Graph)
For i in range(3^E):
Create an empty directed graph G
convert i to trinary
For each nth edge in T:
If the nth digit of i in trinary is 1:
Add the edge to G as (A, B)
If the nth digit of i in trinary is 2:
Add the edge to G as (B, A)
If the nth digit of i in trinary is 0:
Add the reversed AND forward edges to G
For every graph in Results:
If G is isomorphic to Results, STOP
Add G to Results

Finding Clique using graph and vertices

PYTHON ONLY!!
I have a graph
graph = [[0,1,1,1,0],
[1,0,0,0,0],
[0,1,0,1,1],
[1,0,1,0,1],
[0,0,1,1,0]]
The function clique(graph, vertices) should take an adjacency matrix
representation of a graph, a list of one or more vertices, and return a boolean True if the vertices create a clique
(every person is friends with every other person), otherwise return False.
`def clique(graph, vertices)`
I want to find out whether does a clique exists in the graph above
If yes the output should be True, otherwise False
eg. 'clique', (graph,[2,3,4]), True)]
Explanation needed thanks!
https://en.wikipedia.org/wiki/Clique_problem
Here you go, depending on what problem you ACTUALLY want to solve, here you have a starting point for finding an algorithm.
Is a graph a clique: Just check that all nodes are adjacent to each other.
Does it contain a clique? Always true if the graph is non-empty because a single vertex is already a clique of size one.
Does it contain a clique of size k? brute force it
Find a single maximal clique? Greedy algorithm possible as described in the link.
Find all maximal cliques? see wikipedia page for references (this is hard)

Algorithm/Approach to find destination node from source node of exactly k edges in undirected graph of billion nodes without cycle

Consider I have an adjacency list of billion of nodes structured using hash table arranged in the following manner:
key = source node
value = hash_table { node1, node2, node3}
The input values are from text file in the form of
from,to
1,2
1,5
1,11
... so on
eg.
key = '1'
value = {'2','5','11'}
means 1 is connected to nodes 2,5,11
I want to know an algorithm or approach to find destination node from source node of exactly k edges in an undirected graph of billion nodes without cycle
for eg. from node 1 I want to find node 50 only till depth 3 or till 3 edges.
My assumption the algorithm finds 1 - 2 - 60 - 50 which is the shortest path but how would the traversing be efficient using the above adjacency list structure?
I do not want to use Hadoop/Map Reduce.
I came up with naive solution as below in Python but is not efficient. Only thing is hash table searches key in O(1) so I could just search neighbours and their billion neighbours directly for the key. The following algo takes lot of time.
start with source node
use hash table search for finding key
go 1 level deeper with hash table of neighbor nodes and find their values for destination nodes until node found
Stop if node not found on k depth
&nbsp1
&nbsp&nbsp|
{2 &nbsp &nbsp&nbsp&nbsp &nbsp 5 &nbsp&nbsp&nbsp&nbsp&nbsp &nbsp&nbsp&nbsp 11}
&nbsp&nbsp | &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp | &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp |
{3,6,7} {nodes} {nodes} .... connected nodes
&nbsp |&nbsp | &nbsp| &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp | &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp |
{nodes} {nodes} {nodes} .... million more connected nodes.
Please suggest. The algorithm above implemented similar to BFS takes more than 3 hours to search for all the possible key value relationships. Can be it be reduced with other searching method?
As you've hinted, this will depend a lot on the data access characteristics of your system. If you were restricted to single-element accesses, then you'd be truly stuck, as trincot observes. However, if you can manage block accesses, then you have a chance of parallel operations.
However, I think that would be outside your control: the hash function owns the adjacency characteristics -- and, in fact, will likely "pessimize" (opposite of "optimize") that characteristic.
I do see one possible hope: use iteration instead of recursion, maintaining a list of nodes to visit. When you place a new node on the list, get its hash value. If you can organize the nodes clustered by location, you can perhaps do a block transfer, accessing several values in one read operation.

Ordering linestring direction algorithm

I want to build an algorithm in python to flip linestrings (arrays of coordinates) in a linestring collection which represent segments along a road, so that I can merge all coordinates into a single array where the coordinates are rising monotonic.
So my Segmentcollection looks something like this:
segmentCollection = [['1,1', '1,3', '2,3'],
['4,3', '2,3'],
['4,3', '7,10', '5,5']]
EDIT: SO the structure is a list of lists of 2D cartesian coordinate tuples ('1,1' for example is a point at x=1 and y=1, '7,10' is a point at x=7 and y=10, and so on). The whole problem is to merge all these lists to one list of coordinate tuples which are ordered in the sense of following a road in one direction...in fact these are segments which I get from a road network routing service,but I only get segments,where each segment is directed the way it is digitized in the database,not into the direction you have to drive. I would like to get a single polyline for the navigation route out of it.
So:
- I can assume, that all segments are in the right order
- I cannot assume that the Coordinates of each segment are in the right order
- Therefore I also cannot assume that the first coordinate of the first segment is the beginning
- And I also cannot assume that the last coordinate of the last segment is the end
- (EDIT) Even thought I Know,where the start and end point of my navigation request is located,these do not have to be identical with one of the coordinate tuples in these lists,because they only have to be somewhere near a routing graph element.
The algorithm should iterate through every segment, flip it if necessary, and append it then to the resulting array. For the first segment,the challenge is to find the starting point (the point which is NOT connected to the next segment). All other segments are then connected with one point to the last segment in the order (a directed graph).
I'd wonder if there isn't some kind of sorting data structure (sorting tree or anything) which does exactly that. Could you please give some ideas? After messing around a while with loops and array comparisons my brain is knocked out, and I just need a kick into the right direction in the true sense of the word.
If I understand correctly, you don't even need to sort things. I just translated your English text into Python:
def joinSegments( s ):
if s[0][0] == s[1][0] or s[0][0] == s[1][-1]:
s[0].reverse()
c = s[0][:]
for x in s[1:]:
if x[-1] == c[-1]:
x.reverse()
c += x
return c
It still contains duplicate points, but removing those should be straightforward.
def merge_seg(s):
index_i = 0
while index_i+1<len(s):
index_j=index_i+1
while index_j<len(s):
if c[index_i][-1] == c[index_j][0]:
c[index_i].extend(c[index_j][1:])
del c[index_j]
elif c[index_i][-1] == c[index_j][-1]:
c[index_i].extend(c[index_j].reverse()[1:])
del c[index_j]
else:
index_j+=1
index_i+=1
result = []
s.reverse()
for seg_index in range(len(s)-1):
result+=s[seg_index][:-1]#use [:-1] to delete the duplicate items
result+=s[-1]
return result
In inner while loop,every successive segment of s[index_i] is appended to s[index_i]
then index_i++ until every segments is processed.
therefore it is easy to proof that after these while loops, s[0][0] == s[1][-1], s[1][0] == s[2][-1], etc. so just reverse the list and put them together finally you will get your result.
Note: It is the most simple and straightford way, but not most time efficient.
for more algo see:http://en.wikipedia.org/wiki/Sorting_algorithm
You say that you can assume that all segments are in the right order, which means that independently of the coordinates order, your problem is basically to merge sorted arrays.
You would have to flip a segment if it's not defined in the right order, but this doesn't have a single impact on the main algorithm.
simply defind this reordering function:
def reorder(seg):
s1 = min(seg)
e1 = max(seg)
return (s1, e1)
and this comparison funciton
def cmp(seg1, seg2):
return cmp(reorder(seg1), reorder(seg2))
and you are all set, just run a typical merge algorithm:
http://en.wikipedia.org/wiki/Merge_algorithm
And in case, I didn't really understand your problem statement, here's another idea:
Use a segment tree which is a structure that is made exactly to store segments :)

Optimising model of social network evolution

I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1

Categories