Cycles in a highly connected directed graph - networkx - python

I have a moderately sized directed graph consisting of around 3000 nodes and 260000 edges that i have built in networkx. The network is mostly transitive: i.e if a is directed to b, and b is directed to c then a directs to c. I am trying to use simple_cycles algorithm from networkx package to obtain a list of every cycle in that network (i.e any violation of transitivity).
To do this i run
l = nx.simple_cycles(G)
cycle_list = list(l)
where G is the network.
I am running to the issue where the second line cannot run to completion (i've left it running for 24hr). When i apply the algorithm to a subset of 2100 node of the original network it takes around 4 seconds to run.
Any idea where the bottleneck is and what I can do to fix it so it runs quickly.
Update: Creation method
df = pd.read_csv('epsilon_djordje.csv')
edges = [tuple([df['i'][x],df['j'][x]]) if df['f.i.j'][x] > 0 else tuple([df['j'][x],df['i'][x]]) if df['f.i.j'][x] < 0 else tuple([0,0]) for x in range(0,len(df))]
edges = list(set(edges))
edges.remove(tuple([0,0]))
G = nx.DiGraph(edges)
As a reference:
df['i'] is a column of strings (which correspond to the nodes).
df['j'] is a column of strings ( which correspond to the nodes).
df['f.i.j'] is a column of floats (which determine the direction of the edges between two nodes).

Related

Creating a directed scale-free graph with row-stochastic adjacency matrix using Networkx

As part of my dissertation in the field of behavioural economics, I started to work with social networks and opinion dynamics.
For a simulation-based study, I currently require a directed scale-free network featuring a row-stochastic adjacency matrix in order to perform calculations with the Degroot opinion dynamics model.
The aim is to generate a directed scale-free network in which the nodes with the highest out-degree affect many other agents within their network hub, but are influenced themselves only a bit (ingoing weights are still > 0 since I need a positive sum of the respective row in the adjacency matrix).
You can think about the network as a stylized Twitter network where you have a few strongly connected nodes affecting many other nodes, but are not themselves influenced by others.
The problem was that after normalization the network was no longer perceived as directed.In the following my code.
In a first step, I used the Networkx package to generate a scale-free graph and converted the graph object into an adjacency matrix:
G = nx.scale_free_graph(100)
nx.is_directed(G)
Output: True
Subsequently, I normalised the underlying adjacency (e.g., row-stochastic) and converted it back into a graph.
A = nx.to_numpy_array(G)
A_normalized = normalize(A, axis=1, norm='l1')
G_new = nx.from_numpy_matrix(A_normalized)
nx.is_directed(G_new)
Output: False
Can someone explain to me why this is the case or what I can change to make my normalised network count as directed again?
The directed scale-free network is a MultiDiGraph. All you need to do is using the create_using parameter when creating the new graph from the numpy array:
import networkx as nx
G = nx.scale_free_graph(100)
A = nx.to_numpy_array(G)
A_normalized = normalize(A, axis=1, norm='l1')
G_new = nx.from_numpy_array(A_normalized, create_using=nx.MultiDiGraph)

Python NumPy vectorization

I'm trying to code what is known as the List Right Heuristic for the unweighted vertex cover problem. The background is as follows:
Vertex Cover Problem: In the vertex cover problem, we are given an undirected graph G = (V, E) where V is the set of vertices and E is the set of Edges. We need to find the smallest set V' which is a subset of V such that V' covers G. A set V' is said to cover a graph G if all the edges in the graph have at least one vertex in V'.
List Right Heuristic: The algorithm is very simple. Given a list of vertices V = [v1, v2, ... vn] where n is the number of vertices in G, vi is said to be a right neighbor of vj if i > j and vi and vj are connected by an edge in the graph G. We initiate a cover C = {} (empty set) and scan V from right to left. At any point, say the current vertex being scanned is u. If u has at least one right neighbor not in C then u is added to c. The entire V is just scanned once.
I'm solving this for multiple graphs (with same vertices but different edges) at once.
I coded the List Right Heuristic in python. I was able to vectorize it to solve multiple graphs at once, but I was unable to vectorize the original for loop. I'm representing the graph using an Adjacency matrix. I was wondering if it can be further vectorized. Here's my code:
def list_right_heuristic(population: np.ndarray, adj_matrix: np.ndarray):
adj_matrices = np.matlib.repmat(adj_matrix,population.shape[0], 1).reshape((population.shape[0], *adj_matrix.shape))
for i in range(population.shape[0]):
# Remove covered vertices from the graph. Delete corresponding edges
adj_matrices[i, np.outer(population[i], population[i]).astype(bool)] = 0
vertex_covers = np.zeros(shape=population.shape, dtype=population.dtype)
for index in range(population.shape[-1] - 1, -1, -1):
# Get num of intersecting elements (for each row) in right neighbors and vertex_covers
inclusion_rows = np.sum(((1 - vertex_covers) * adj_matrices[..., index])[..., index + 1:], axis=-1).astype(bool)
# Only add vertices to cover for rows which have at least one right neighbor not in vertex cover
vertex_covers[inclusion_rows, index] = 1
return vertex_covers
I have p graphs that I'm trying to solve simultaneously, where p=population.shape[0]. Each graph has the same vertices but different edges. The population array is a 2D array where each row indicates vertices of the graph G that are already in the cover. I'm only trying to find the vertices which are not in the cover. So for this reason, setting all rows and columns of vertices in cover to 0, i.e., I'm deleting the corresponding edges. The heuristic should theoretically only return vertices not in the cover now.
So in the first for loop, I just set the corresponding rows and columns in the adjacency matrix to 0 ( all elements in the rows and columns will be zero). Next I'm going through the 2D array of vertices from right to left and finding number of right neighbors in each row not in vertex_covers. For this I'm first finding the vertices not in cover (1 - vertex_covers) and then multiplying that with corresponding columns in adj_matrices (or rows since adj matrix is symmetric) to get neighbors of that that vertex we're scanning. Then I'm summing all elements to the right of this. If this value is greater than 0 then there's at least one right neighbor not in vertex_covers.
Am I doing this correctly for one?
And is there any way to vectorize the second for loop ( or the first for that matter) or speed up the code in general? calling this function thousands of times in some other code for large graphs (with 1000+ vertices). Any help would be appreciated.
You can use np.einsum to perform many complex operations between indices. In your case, the first loop can be performed this way:
adj_matrices[np.einsum('ij, ik->ijk', population, population).astype(bool)] = 0
It took me some time to understand how einsum works. I found this SO question very helpful.
BTW, Your code gave me the following syntax error:
SyntaxError: can use starred expression only as assignment target
and I had to re-write the first line of the function as:
adj_matrices = np.matlib.repmat(adj_matrix,population.shape[0],
1).reshape((population.shape[0],) + adj_matrix.shape)

Keeping scale free graph's degree distribution after perturbation - python

Given a scale free graph G ( a graph whose degree distribution is a power law), and the following procedure:
for i in range(C):
coint = randint(0,1)
if (coint == 0):
delete_random_edges(G)
else:
add_random_edge(G)
(C is a constant)
So, when C is large, the degree distribution after the procedure would be more like G(n,p). I am interested in preserving the power law distribution, i.e. - I want the graph to be scale free after this procedure, even for large C.
My idea is writing the procedures "delete_random_edges" and "add_random_edge" in a way that will give edges that connected to node with big degree small probability to be deleted (when adding new edge, it would be more likely to add it to node with large degree).
I use Networkx to represent the graph, and all I found is procedures that delete or add a specific edge. Any idea how can I implement the above?
Here's 2 algorithms:
Algorithm 1
This algorithm does not preserve the degree exactly, rather it preserves the expected degree.
Save each node's initial degree. Then delete edges at random. Whenever you create an edge, do so by randomly choosing two nodes, each with probability proportional to the initial degree of those nodes.
After a long period of time, the expected degree of each node 'u' is its initial degree (but it might be a bit higher or lower).
Basically, this will create what is called a Chung-Lu random graph. Networkx has a built in algorithm for creating them.
Note - this will allow the degree distribution to vary.
algorithm 1a
Here is the efficient networkx implementation skipping over the degree deleting and adding and going straight to the final result (assuming a networkx graph G):
degree_list = G.degree().values()
H = nx.expected_degree_graph(degree_list)
Here's the documentation
Algorithm 2
This algorithm preserves the degrees exactly.
Choose a set of edges and break them. Create a list, with each node appearing equal to the number of broken edges it was in. Shuffle this list. Create new edges between nodes that appear next to each other in this list.
Check to make sure you never join a node to itself or to a node which is already a neighbor. If this would occur you'll want to think of a custom way to avoid it. One option is to simply reshuffle the list. Another is to set those nodes aside and include them in the list you create next time you do this.
edit
There is a built in networkx command double_edge_swapto swap two edges at a time. documentation
Although you have already accepted the answer from #abdallah-sobehy, meaning that it works, I would suggest a more simple approach, in case it helps you or anybody around.
What you are trying to do is sometimes called preferential attachment (well, at least when you add nodes) and for that there is a random model developed quite some time ago, see Barabasi-Albert model, which leads to a power law distribution of gamma equals -3.
Basically you have to add edges with probability equal to the degree of the node divided by the sum of the degrees of all the nodes. You can scipy.stats for defining the probability distribution with a code like this,
import scipy.stats as stats
x = Gx.nodes()
sum_degrees = sum(list(Gx.degree(Gx).values()))
p = [Gx.degree(x)/sum_degrees for x in Gx]
custm = stats.rv_discrete(name='custm', values=(x, p))
Then you just pick 2 nodes following that distribution, and that's the 2 nodes you add an edge to,
custm.rvs(size=2)
As for deleting the nodes, I haven't tried that myself. But I guess you could use something like this,
sum_inv_degrees = sum([1/ x for x in list(Gx.degree(Gx).values())])
p = [1 / (Gx.degree(x) * sum_inv_degrees) for x in Gx]
although honestly I am not completely sure; it is not anymore the random model that I link to above...
Hope it helps anyway.
UPDATE after comments
Indeed by using this method for adding nodes to an existing graph, you could get 2 undesired outcomes:
duplicated links
self links
You could remove those, although it will make the results deviate from the expected distribution.
Anyhow, you should take into account that you are deviating already from the preferential attachment model, since the algorithm studied by Barabasi-Albert works adding new nodes and links to the existing graph,
The network begins with an initial connected network of m_0 nodes.
New nodes are added to the network one at a time. Each new node is connected to m > m_0 existing nodes with a probability that is proportional to the number
...
(see here)
If you want to get an exact distribution (instead of growing an existing network and keeping its properties), you're probably better off with the answer from #joel
Hope it helps.
I am not sure to what extent this will preserve the scale free property but this can be a way to implement your idea:
In order to add an edge you need to specify 2 nodes in networkx. so, you can choose one node with a probability that is proportional to (degree) and the other node to be uniformly chosen (without any preferences). Choosing a highly connected node can be achieved as follows:
For a graph G where nodes are [0,1,2,...,n]
1) create a list of floats (limits) between 0 and 1 to specify for each node a probability to be chosen according to its degree^2. For example: limits[1] - limits[0] is the probability to choose node 0, limits[2] - limits[1] is probability to choose node 2 etc.
# limits is a list that stores floats between 0 and 1 which defines
# the probabaility of choosing a certain node depending on its degree
limits = [0.0]
# store total number of edges of the graph; the summation of all degrees is 2*num_edges
num_edges = G.number_of_edges()
# store the degree of all nodes in a list
degrees = G.degree()
# iterate nodes to calculate limits depending on degree
for i in G:
limits.append(G.degree(i)/(2*num_edges) + limits[i])
2) Randomly generate a number between 0 and 1 then compare it to the limits, and choose the node to add an edge to accordingly:
rnd = np.random.random()
# compare the random number to the limits and choose node accordingly
for j in range(len(limits) - 1):
if rnd >= limits[j] and rnd < limits[j+1]:
chosen_node = G.node[j]
3) Choose another node uniformly, by generating a random integer between [0,n]
4) Add an edge between both of the chosen nodes.
5) Similarly for deleting edge, you can choose a node according to (1/degree) instead of degree then uniformly delete any of its edges.
It is interesting to know if using this approach would reserve the scale free property and at which 'C' the property is lost , so let us know if it worked or not.
EDIT: As suggested by #joel the selection of the node to add an edge to should be proportional to degree rather than degree^2. I have edited step 1 accordingly.
EDIT2: This might help you to be able to judge if the scale free graph lost its property after edges addition and removal. Simply comute the preferential attachmet score before and after changes. You can find the doumentation here.

APGL (Another Python Graph Library) - Set subgraph size after a Breadth First Search

I've been using the apgl 0.8.1. library to analyse a massive network (40 million of nodes). I tried apgl because it works fine with scipy sparse matrices, and now I can load a sparse matrix into memory to perform analyses. I'm interested in obtaining a subgraph with a desidered size of the all network, after a Breadth First Search Analysis.
I'm reading an adjancency list from pandas to build a sparse matrix. Consider this sample network (Node,Node,Weight) called network:
1 5 1
5 1 1
1 2 1
5 6 1
6 7 1
7 5 1
5 2 1
2 3 1
3 4 1
4 2 1
3 8 1
9 10 1
1 11 1
11 12 1
12 13 1
13 1 1
5 14 1
This is the sample code I'm using:
# Import Modules
import pandas as pd
import numpy as np
import scipy as sp
from apgl.graph import SparseGraph
# Load Network
df = pd.read_csv(network,sep='\s+',header=None,names=['User1','User2','W'])
# Convert the numpy array from pandas into a NxN square matrix
# Read Numpy array from Pandas
arr = df.values
# Set matrix shapes
shapes = tuple(arr.max(axis=0)[:2]+1)
# Build a Sparse Matrix
matrix = sp.sparse.csr_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])),
shape=shape,
dtype=arr.dtype)
# Set the total number of nodes
numVertices = shape[0]
# Inizialize Graph
graph = SparseGraph(numVertices, undirected=False, W=matrix, frmt='csr')
# Perform BFS starting from one one --> set output to np.array
startingnode = 5
bfs = np.array(graph.breadthFirstSearch(startingnode))
# Return SubGraph with a list of nodes
# Set limit
limit = 5
subgraph = graph.subgraph(bfs[:limit])
which returns:
bfs = [ 5 1 2 6 14 11 3 7 12 4 8 13]
subgraph = SparseGraph: vertices 5, edges 6, directed, vertex storage GeneralVertexList, edge storage <class 'scipy.sparse.csr.csr_matrix'>
So I set up a limit of 5 nodes of the resulting subgraph. But the nodes are chosen from the first to the fifth, no matter the shape of the search. I mean, the Breadth First Search algorithm is looking for neighbours of neighbours and so on, and I would like to set a limit which includes the search in the all final neighbours level of the last node chosen. In the example, the subgraph contains the first five nodes of the BFS array, so:
subgraph = [5 1 2 6 14]
but I would like to include also nodes 7 (which complete the neighbours level starting from node 5) and 3 (which complete the search in the level of node 2). So the resulting array of nodes should be:
subgraph = [5 1 2 3 6 7 14]
Any help would be appreciated.
EDIT:
What I would like to do is to find a subgraph of the entire graph starting from a random node, and performing a BFS algorithm till different size of the subgraph, e.g. 5 million, 10 million and 20 million nodes. I would like to complete each level of neighbours of nodes before stopping, so it doesn't matter if the number of nodes is 5 million or 5 million and 100, if the last 100 nodes are needed to complete a level of all neighbours of the last node found in the search.
BFS and DFS operate over undirected graphs which could be acyclic (i.e. trees) or not.
During their operation, both algorithms may encounter back-edges. That is, edges that if traversed would bring the algorithm to consider nodes already visited.
These edges and the nodes at their ends are not (normally) reported in the output of BFS or DFS.
Networkx (https://networkx.github.io/) contains dfs_labeled_edges which returns an iterator with a characteristic for each edge.
As far as your code is concerned:
limit = 5
subgraph = graph.subgraph(bfs[:limit])
This is not going to search for all BFS subgraphs of 'length' 5. Because of bfs[:5] this will always be looking at the first 5 entries of the BFS' output.
If you are looking for cycles, perhaps you can use a different algorithm (For example, this is again from Networkx), or extract your subgraph from the original network and then use DFS to label its edges, or enumerate all of its simplest paths and try to work on them.
Hope this helps.
Supplementary Information:
(I am also taking this older question of yours into account here: https://stackoverflow.com/questions/29169022/scipy-depth-breadth-first-search-tree-size-limit)
You want to extract a proper subgraph U(W,C) from your main graph G(V,E) with the order of U being much smaller than the order of G (|U|<<|G|) and furthermore with U being the BFS of G starting on some node v_i of G(V,E).
There are two ways you can do this:
Write your own BFS where you can add counters for the depth of
traversal and nodes traversed and use them to interrupt the
algorithm wherever you like. Due to the extremely large number of nodes that you have, you should look into the iterative rather than the recursive version of the algorithm. For more information, please see: Way to go from recursion to iteration and also to some extent http://en.wikipedia.org/wiki/Depth-limited_search . This approach will be more efficient because the BFS would not have to go through the whole graph.
Truncate the output of an existing BFS and use the remainder nodes
as your starting points for the next step.
In either case, your algorithm will contain one more iteration step and will end up looking something like this (here for option #2):
#Given graph G(V,E) and NNodeLimit (Natural number)
#Produce a set Q of BFS proper subgraphs bearing the characteristics of U.
Q = []
nextBFSNode = [0]
while len(nextBFSNode):
#Pop the node
startingPoint = nextBFSNode.pop()
#Build a BFS graph starting from some node
q = BFS(G, startingPoint)
#Truncate its output and save it to the list.
Q.append(subgraph(G,q[:NNodeLimit]))
#Add the remaining nodes as future starting points
nextBFSNode = set(q[NNodeLimit+1:])
There will of course be considerable overlap between different trees.
Hope this helps.

Creating fixed set of nodes using networkx in python

I have a problem concerning graph diagrams. I have 30 nodes(points). I want to construct an adjacency matrix in such a way that each ten set of nodes are like at a vertices of a triangle. So lets say a group of 10 nodes is at the vertex A, B and C of a triangle ABC.
Two of the vertex sets should have only 10 edges(basically each node within a cluster is connected to other one). Lets say groups at A and B have 10 edges within the group. While the third vertex set should have 11 edges(10 for each nodes and one node connecting with two nodes, so 11 edges in that group). Lets say the one at C has 11 edges in it.
All these three clusters would be having one edge between them to form a triangle.That is connect group at A with group at B with one edge and B with C with one edge and C with A with one edge.
Later on I would add one more edge between B and C. Represented as dotted line in the attached figure. The point at a vertex can be in a circle or any other formation as long as they represent a group.
How do I create an adjacency matrix for such a thing. I actually know how to create the adjacency matrix for such a matrix as it is just binary symmetric matrix(undirected graph) but the problem is when I try to plot that adjacency matrix it would bring the one node from other group closer to the group to which that node is connected. So lets say I connected one node at Vertex A with one node at Vertex B by connecting an edge between the two. This edge would depict the side AB of the triangle. But when I depict it using networkx then those two nodes which are connected from these two different groups would eventually come closer and look like part of one group. How do I keep it as separate group. ?
Pls note I am making use of networkx lib of python which helps plot the adjacency matrix.
EDIT:
A code I am trying to use after below inspiration:
G=nx.Graph()
# Creating three separate groups of nodes (10 nodes each)
node_clusters = [range(1,11), range(11,21) , range(21,31)]
# Adding edges between each set of nodes in each group.
for x in node_clusters:
for y in x:
if(y!=x[-1]):
G.add_edge(y,y+1,len=2)
else:
G.add_edge(y,x[0],len=2)
# Adding three inter group edges separately:
for x in range(len(node_clusters)):
if(x<2):
G.add_edge(node_clusters[x][-1],node_clusters[x+1][0],len=8)
else:
G.add_edge(node_clusters[x][-1],node_clusters[0][0],len=8)
nx.draw_graphviz(G, prog='neato')
Gives the following error:
--> 260 '(not available for Python3)')
261 if root is not None:
262 args+="-Groot=%s"%root
ImportError: ('requires pygraphviz ', 'http://networkx.lanl.gov/pygraphviz ', '(not available for Python3)')
My python version is not 3, its 2. And am using anaconda distribution
EDIT2:
I used Marius's code but instead used the following to plot:
graph_pos=nx.spring_layout(G,k=0.20,iterations=50)
nx.draw_networkx(G,graph_pos)
It has destroyed completely the whole graph. and shows this:
I was able to get something going fairly quickly just by hacking away at this, all you need to do is put together tuples representing each edge, you can also set some arbitrary lengths on the edges to get a decent approximation of your desired layout:
import networkx
import string
all_nodes = string.ascii_letters[:30]
a_nodes = all_nodes[:10]
b_nodes = all_nodes[10:20]
c_nodes = all_nodes[20:]
all_edges = []
for node_set in [a_nodes, b_nodes, c_nodes]:
# Link each node to the next
for i, node in enumerate(node_set[:-1]):
all_edges.append((node, node_set[i + 1], 2))
# Finish off the circle
all_edges.append((node_set[0], node_set[-1], 2))
joins = [(a_nodes[0], b_nodes[0], 8), (b_nodes[-1], c_nodes[0], 8), (c_nodes[-1], a_nodes[-1], 8)]
all_edges += joins
# One extra edge for C:
all_edges.append((c_nodes[0], c_nodes[5], 5))
G = networkx.Graph()
for edge in all_edges:
G.add_edge(edge[0], edge[1], len=edge[2])
networkx.draw_graphviz(G, prog='neato')
Try something like networkx.to_numpy_matrix(G) if you then want to export as an adjacency matrix.

Categories