I am reading graphs such as http://www.dis.uniroma1.it/challenge9/data/rome/rome99.gr from http://www.dis.uniroma1.it/challenge9/download.shtml in python. For example, using this code.
#!/usr/bin/python
from igraph import *
fname = "rome99.gr"
g = Graph.Read_DIMACS(fname, directed=True )
(I need to change the line "p sp 3353 8870" " to "p max 3353 8870" to get this to work using igraph.)
I would like to convert the graph to one where all nodes have outdegree 1 (except for extra zero weight edges we are allowed to add) but still preserve all shortest paths. That is a path between two nodes in the original graph should be a shortest path in the new graph if and only if it is a shortest path in the converted graph. I will explain this a little more after an example.
One way to do this I was thinking is to replace each node v by a little linear subgraph with v.outdegree(mode=OUT) nodes. In the subgraph the nodes are connected in sequence by zero weight edges. We then connect nodes in the subgraph to the first node in other little subgraphs we have created.
I don't mind using igraph or networkx for this task but I am stuck with the syntax of how to do it.
For example, if we start with graph G:
I would like to convert it to graph H:
As the second graph has more nodes than the first we need to define what we mean by its having the same shortest paths as the first graph. I only consider paths between either nodes labelled with simple letters of with nodes labelled X1. In other words, in this example a path can't start or end in A2 or B2. We also merge all versions of a node when considering a path. So a path A1->A2->D in H is regarded as the same as A->D in G.
This is how far I have got. First I add the zero weight edges to the new graph
h = Graph(g.ecount(), directed=True)
#Connect the nodes with zero weight edges
gtoh = [0]*g.vcount()
i=0
for v in g.vs:
gtoh[v.index] = i
if (v.degree(mode=OUT) > 1):
for j in xrange(v.degree(mode=OUT)-1):
h.add_edge(i,i+1, weight = 0)
i = i+1
i = i + 1
Then I add the main edges
#Now connect the nodes to the relevant "head" nodes.
for v in g.vs:
h_v_index = gtoh[v.index]
i = 0
for neighbour in g.neighbors(v, mode=OUT):
h.add_edge(gtoh[v.index]+i,gtoh[neighbour], weight = g.es[g.get_eid(v.index, neighbour)]["weight"])
i = i +1
Is there a nicer/better way of doing this? I feel there must be.
The following code should work in igraph and Python 2.x; basically it does what you proposed: it creates a "linear subgraph" for every single node in the graph, and connects exactly one outgoing edge to each node in the linear subgraph corresponding to the old node.
#!/usr/bin/env python
from igraph import Graph
from itertools import izip
def pairs(l):
"""Given a list l, returns an iterable that yields pairs of the form
(l[i], l[i+1]) for all possible consecutive pairs of items in l"""
return izip(l, l[1:])
def convert(g):
# Get the old vertex names from g
if "name" in g.vertex_attributes():
old_names = map(str, g.vs["name"])
else:
old_names = map(str, xrange(g.vcount))
# Get the outdegree vector of the old graph
outdegs = g.outdegree()
# Create a mapping from old node IDs to the ID of the first node in
# the linear subgraph corresponding to the old node in the new graph
new_node_id = 0
old_to_new = []
new_names = []
for old_node_id in xrange(g.vcount()):
old_to_new.append(new_node_id)
new_node_id += outdegs[old_node_id]
old_name = old_names[old_node_id]
if outdegs[old_node_id] <= 1:
new_names.append(old_name)
else:
for i in xrange(1, outdegs[old_node_id]+1):
new_names.append(old_name + "." + str(i))
# Add a sentinel element to old_to_new just to make our job easier
old_to_new.append(new_node_id)
# Create the edge list of the new graph and the weights of the new
# edges
new_edgelist = []
new_weights = []
# 1) Create the linear subgraphs
for new_node_id, next_new_node_id in pairs(old_to_new):
for source, target in pairs(range(new_node_id, next_new_node_id)):
new_edgelist.append((source, target))
new_weights.append(0)
# 2) Create the new edges based on the old ones
for old_node_id in xrange(g.vcount()):
new_node_id = old_to_new[old_node_id]
for edge_id in g.incident(old_node_id, mode="out"):
neighbor = g.es[edge_id].target
new_edgelist.append((new_node_id, old_to_new[neighbor]))
new_node_id += 1
print g.es[edge_id].source, g.es[edge_id].target, g.es[edge_id]["weight"]
new_weights.append(g.es[edge_id]["weight"])
# Return the graph
vertex_attrs = {"name": new_names}
edge_attrs = {"weight": new_weights}
return Graph(new_edgelist, directed=True, vertex_attrs=vertex_attrs, \
edge_attrs=edge_attrs)
Related
I am trying to implement the Chinese Whispers Algorithm, but I can not figure out the issue below: the result that I want to get is like in the picture below
import networkx as nx
from random import shuffle as shuffle
# build nodes and edge lists
nodes = [
(1,{'attr1':1}),
(2,{'attr1':1})
]
edges = [
(1,2,{'weight': 0.732})
]
# initialize the graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
G.node[v[0]]["class"] = v[1]["attr1"]
# add edges
G.add_edges_from(edges)
# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
gn = G.nodes()
# I randomize the nodes to give me an arbitrary start point
shuffle(gn)
for node in gn:
neighs = G[node]
classes = {}
# do an inventory of the given nodes neighbours and edge weights
for ne in neighs:
if isinstance(ne, int) :
if G.node[ne]['class'] in classes:
classes[G.node[ne]['class']] += G[node][ne]['weight']
else:
classes[G.node[ne]['class']] = G[node][ne]['weight']
# find the class with the highest edge weight sum
max = 0
maxclass = 0
for c in classes:
if classes[c] > max:
max = classes[c]
maxclass = c
# set the class of target node to the winning local class
G.node[node]['class'] = maxclass
I want to make this output
enter image description here
As noted in the answer by #Shridhar R Kulkarni, you want to use G.nodes when referencing nodes for updating/adding attributes.
Another problem with the script is that for shuffling, you want only the list of node identifiers:
gn = list(G.nodes()) # use list() to allow shuffling
shuffle(gn)
If you are interested in just the computation, rather than (re)implementation, you could also use an existing library chinese-whispers.
Use G.nodes instead of G.node.
I am currently working on a graph of twitter users, where I have 2 csv files, one is the node list with close to 147,000 nodes and the other is a edge list with all the relationships between the users.
When I import the files to networkx and use the info() method on the graph it tells me i have upwards of 5,000,000 nodes in the graph (The figure is similar if I use info() on both the directed and undirected version of the graph)
I have tried this with smaller datasets and the number of nodes matched the number in my node list file. Does anyone know why this may be happening?
many thanks
EDIT
The code I am using can be seen below
import csv
import networkx as nx
import pandas as pd
with open('node list.csv', 'r') as nodecsv: # Open the file
nodereader = csv.reader(nodecsv) # Read the csv
# Retrieve the data (using Python list comprhension and list slicing to remove the header row, see footnote 3)
nodes = [n for n in nodereader][1:]
node_names = [n[0] for n in nodes] # Get a list of only the node names
with open('edge list.csv', 'r') as edgecsv: # Open the file
edgereader = csv.reader(edgecsv) # Read the csv
edges = [tuple(e) for e in edgereader][1:] # Retrieve the data
print(len(node_names))
print(len(edges))
G = nx.Graph()
# G.add_nodes_from(node_names)
G.add_edges_from(edges)
print(nx.info(G))
print(total_nodes)
follower_count_dict = {}
friend_count_dict = {}
staus_count_dict = {}
created_at_dict = {}
for node in nodes: # Loop through the list, one row at a time
follower_count_dict[node[0]] = node[1]
friend_count_dict[node[0]] = node[2]
staus_count_dict[node[0]] = node[3]
created_at_dict[node[0]] = node[4]
#print( user_followers_count_dict)
nx.set_node_attributes(G, follower_count_dict, 'follower_count')
nx.set_node_attributes(G, friend_count_dict, 'friend_count')
nx.set_node_attributes(G, staus_count_dict, 'staus_count')
nx.set_node_attributes(G, created_at_dict, 'created_at')
DG = nx.DiGraph()
DG.add_nodes_from(node_names)
DG.add_edges_from(edges)
nx.set_node_attributes(DG, follower_count_dict, 'follower_count')
nx.set_node_attributes(DG, friend_count_dict, 'friend_count')
nx.set_node_attributes(DG, staus_count_dict, 'staus_count')
nx.set_node_attributes(DG, created_at_dict, 'created_at')
Snapshot of user list file
Snapshot of edge list file
Your edge list includes nodes that do not appear in your node list. So when those edges are added, networkx adds the nodes as well.
Reasons for this could include the nodes being treated as strings with different white space (perhaps '\n' at the end), or nodes being treated as integers in some cases and strings in others.
A way to deal with this is that before you add the edges, do a loop that checks whether each node is in the graph and if not, prints out the node:
for edge in edges:
for node in edge:
if node not in G:
print(node)
I am using networkx to create an algorithm to calculate the modularity for the different communities. Now I am getting this key problem when I was doing G[complst[i]][complst[j]]['weight'], whereas I printed out complst[i] and compost[j] and find these values are correct. Anyone can help? I tried many ways to debug it such as saving them in seperate variables but they don't help.
import networkx as nx
import copy
#load the graph made in previous task
G = nx.read_gexf("graph.gexf")
#set a global max modualrity value
maxmod = 0
#deep copy of the coriginal graph, since when removing edges, the graph will change
ori = copy.deepcopy(G)
#create an array for saving the edges to remove
arr = []
#see if all edges are broken, if not, keep looping, otherwise stop
while(G.number_of_edges()!=0):
#find the edge_betweeness for each edge
betweeness = nx.edge_betweenness_centrality(G,weight='weight',normalized=False)
print('------------------******************--------------------')
#sort the result in descending order and save all edges with the maximum betweenness to 'arr'
sortbet = {k: v for k, v in sorted(betweeness.items(), key=lambda item: item[1],reverse=True)}
#covert the dict to list for processing
betlst = list(sortbet)
for i in range(len(betlst)):
if betlst[i] == betlst[0]:
arr.append(betlst[i])
#remove all edges with maximum betweeness from the graph
G.remove_edges_from(arr)
#find the leftover component, and convert the result to list for further modualrity processing
lst = list(nx.connected_components(G))
#!!!!!!!!testing and debugging the value, now the value is printed correctly
print(G['pk_sullivan']['ChrisWarcraft']['weight'])
#create a variable cnt to represent modularity in this graph
cnt = 0
#iterate the lst, which is each component(each component is saved as python set)
for n in range(len(lst)):
#convert each component from set to list for processing
complst = list(lst[n])
#if this component is a singleton, the modualrity for this component 0, so add 0 the current cnt
if len(complst)==1:
cnt += 0
else:
# calulate the modularity for this component by using combinations of edges
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
#!!!!!!!!! there is a bunch of my testing and find the value are printed all fine until "print(G[a][b]['weight'])""
print(i)
print(j)
print(complst)
a = complst[i]
print(type(a))
b = complst[j]
print(type(b))
print(G[a][b]['weight'])
#calculate the modualrity by using equation M = 1/2m*(weight(a,b)-degree(a)*degree(b)/2m)
cnt += 1/(2*ori.number_of_edges())*(G[a][b]['weight']-ori.degree(a)*ori.degree(b)/(2*ori.number_of_edges()))
#find the maximum modualrity and save this split of graph, end!
if cnt>=maxmod:
maxmod = cnt
newgraph = copy.deepcopy(G)
print('maxmod is',maxmod)
here is the error, welcome to run the code and hope my code illustration can help!
It looks like you're trying to find the weight of all combinations of nodes within each connected component. But the problem, is that you're assuming that all nodes in a connected component are first degree connected, i.e. are connected through a single edge, which is wrong.
In your code, you have:
...
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
...
And then you try to find the weight of the edge that connects these two nodes. But every edge in a connected component is not connected to the rest. A connected component just means that all nodes are reachable from all others.
So you should be iterating over the edges in the subgraph generated by the connected component, or something along these lines.
The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
queue.append(node)
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
def compute_resilience(ugraph, attack_order):
"""
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
"""
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
ugraph[neighbor].remove(node)
ugraph.pop(node)
res.append(largest_cc_size(ugraph))
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
Review
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
queue.append(node)
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
else:
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
remaining_nodes.difference_update(visited)
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes.difference_update(visited)
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
res.append(largest_cc_size(ugraph))
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
Returns:
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
"""
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
nodes.difference_update(component)
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
"""
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
graph[neighbor].remove(node)
graph.pop(node)
component.remove(node)
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
canonical.update(canon)
components.update(comp)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
"""
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
graph[i].add(j)
graph[j].add(i)
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
0.28782312001567334
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
59.46968446299434
So the revised code is about 200 times faster on this test case.
I'm working in a project using the library Networkx ( for graph management ) in Python, and I been having trouble trying to implement what I need
I have a collection of directed graphs, holding special objects as nodes and weights associated with the edges, the thing is I need to go through the graph from output nodes to input nodes. and for each node I have to take the weights from their predecessors and an operation calculated by that predecessor node to build the operation form my output node. But the problem is that the operations of the predecessors may depend from their own predecessors, and so on, so I'm wondering how I can solve this problem.
So far I have try the next, lets say I have a list of my output nodes and I can go through the predecessors using the methods of the Networkx library:
# graph is the object containig my directe graph
for node in outputNodes:
activate_predecessors(node , graph)
# ...and a function to activate the predecessors ..
def activate_predecessors( node = None , graph ):
ws = [] # a list for the weight
res = [] # a list for the response from the predecessor
for pred in graph.predecessors( node ):
# get the weights
ws.append( graph[pred][node]['weight'] )
activate_predecessors( pred , graph )
res.append( pred.getResp() ) # append the response from my predecessor node to a list, but this response depend on their own predecessors, so i call this function over the current predecessor in a recursive way
# after I have the two lists ( weights and the response the node should calculate a reduce operation
# do after turning those lists into numpy arrays...
node.response = np.sum( ws*res )
This code seems to work... I tried it on in some random many times, but in many occasions it gives a maximum recursion depth exceeded so I need to rewrite it in a more stable ( and possibly iterative ) way in order to avoid maximum recursion. but I'm running out of ideas to handle this..
the library has some searching algorithms (Depth first search) but after I don't know how it could help me to solve this.
I also try to put some flags on the nodes to know if it had been already activated but I keep getting the same error.
Edit: I forgot, the input nodes have a defined response value so they don't need to do calculations.
your code may contain an infinite recursion if there is a cycle between two nodes. for example:
import networkx as nx
G = nx.DiGraph()
G.add_edges_from([(1,2), (2,1)])
def activate_nodes(g, node):
for pred in g.predecessors(node):
activate_nodes(g, pred)
activate_nodes(G, 1)
RuntimeError: maximum recursion depth exceeded
if you have possible cycles on one of the graphs you better mark each node as visited or change the edges on the graph to have no cycles.
assuming you do not have cycles on your graphs here is an example of how to implement the algorithm iteratively:
import networkx as nx
G = nx.DiGraph()
G.add_nodes_from([1,2,3])
G.add_edges_from([(2, 1), (3, 1), (2, 3)])
G.node[1]['weight'] = 1
G.node[2]['weight'] = 2
G.node[3]['weight'] = 3
def activate_node(g, start_node):
stack = [start_node]
ws = []
while stack:
node = stack.pop()
preds = g.predecessors(node)
stack += preds
print('%s -> %s' % (node, preds))
for pred in preds:
ws.append(g.node[pred]['weight'])
print('weights: %r' % ws)
return sum(ws)
print('total sum %d' % activate_node(G, 1))
this code prints:
1 -> [2, 3]
3 -> [2]
2 -> []
2 -> []
weights: [2, 3, 2]
total sum 7
Note
you can reverse the direction of the directed graph using DiGraph.reverse()
if you need to use DFS or something else you can reverse the graph to get the predecessor as just the directly connected neighbours of that node. Using this, algorithms like DFS might be easier to use.