Use NetworkX to find cycles in MultiDiGraph imported from shapefile - python

I am writing a QGIS plugin which will use the NetworkX library to manipulate and analyze stream networks. My data comes from shapefiles representing stream networks.
(arrows represent direction of stream flow)
Within this stream network are braids which are important features I need to retain. I am categorizing braid features into "simple" (two edges that share two nodes) and "complex" (more than two edges, with more than two nodes).
Simple braid example
Complex braid example
Normally, I would just use the NetworkX built-in function read_shp to import the shapefile as a DiGraph. As is evident in the examples, the "simple" braid will be considered a parallel edge in a NetworkX DiGraph, because those two edges (which share the same to and from nodes) would be collapsed into a single edge. In order to preserve these multiple edges, we wrote a function that imports a shapefile as a MultiDiGraph. Simple braids (i.e. parallel edges) are preserved by using unique keys in the edge objects (this is embedded in a class):
def _shp_to_nx(self, in_network_lyr, simplify=True, geom_attrs=True):
This is a re-purposed version of read_shp from the NetworkX library.
:param shapelayer:
:param simplify:
:param geom_attrs:
self.G = nx.MultiDiGraph()
for f in in_network_lyr.getFeatures():
flddata = f.attributes()
fields = [str( for fi in f.fields()]
geo = f.geometry()
# We don't care about M or Z
attributes = dict(zip(fields, flddata))
# Add a new _FID_ field
fid = int(
attributes[self.id_field] = fid
attributes['_calc_len_'] = geo.length()
# Note: Using layer level geometry type
if geo.wkbType() in (QgsWKBTypes.LineString, QgsWKBTypes.MultiLineString):
for edge in self.edges_from_line(geo, attributes, simplify, geom_attrs):
e1, e2, attr = edge
self.features[fid] = attr
self.G.add_edge(tuple(e1), tuple(e2), key=attr[self.id_field], attr_dict=attr)
self.cols = self.features[self.features.keys()[0]].keys()
raise ImportError("GeometryType {} not supported. For now we only support LineString types.".
I have already written a function to find the "simple" braid features (I just iterate through the MultiDiGraphs nodes, and find edges with more than one key). But I also need to find the "complex" braids. Normally, in a Graph, I could use the cycle_basis to find all of the "complex" braids (i.e. cycles), however, the cycle_basis method only works on un-directed Graphs, not directional graphs. But I'd rather not convert my MultiDiGraph into an un-directed Graph, as there can be unexpected results associated with that conversion (not to mention losing my edge key values).
How could I go about finding cycles which are made up of more than one edge, in a relatively time-efficient way? The stream networks I'm really working with can be quite large and complex, representing large watersheds.

So I came up with a solution, for finding both "simple" and "complex" braids.
def get_complex_braids(self, G, attrb_field, attrb_name):
Create graph with the braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
if nx.is_directed(G):
UG = nx.Graph(G)
braid_G = nx.MultiDiGraph()
for edge in G.edges(data=True, keys=True):
is_edge = self.get_edge_in_cycle(edge, UG)
if is_edge == True:
self.update_attribute(braid_G, attrb_field, attrb_name)
return braid_G
print "ERROR: Graph is not directed."
braid_complex_G = nx.null_graph()
return braid_complex_G
def get_simple_braids(self, G, attrb_field, attrb_name):
Create graph with the simple braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
braid_simple_G = nx.MultiDiGraph()
parallel_edges = []
for e in G.edges_iter():
keys = G.get_edge_data(*e).keys()
if keys not in parallel_edges:
if len(keys) == 2:
for k in keys:
data = G.get_edge_data(*e, key=k)
braid_simple_G.add_edge(e[0], e[1], key=k, attr_dict=data)
self.update_attribute(braid_simple_G, attrb_field, attrb_name)
return braid_simple_G

This is not a definite answer, but longer than maximum allowed characters for a comment, so I post it here anyway.
To find simple braids, you can use built-in methods G.selfloop_edges and G.nodes_with_selfloops.
I haven't heard about cycle_basis for directed graphs, can you provide a reference (e.g. scientific work)? NetworkX has simple_cycles(G) which works on directed Graphs, but it is also not useful in this case, because water does not visit any node twice (or?).
I am afraid that the only way is to precisely describe the topology and then search the graph to find matching occurrences. let me clarify my point with an example. the following function should be able to identify instances of complex braids similar to your example:
def Complex_braid(G):
res = []
# find all nodes with out_degree greater than one:
candidates = [n for n in G.nodes() if len(G.successors(n)) > 1]
# find successors:
for n in candidates:
succ = G.successors(n)
for s in succ:
if len(list(nx.all_simple_paths(G,n,s))) > 1:
all_nodes = sorted(list(nx.all_simple_paths(G,n,s)), key=len)[-1]
return res
G = nx.MultiDiGraph()
G.add_edges_from([(0,1), (1,2), (2,3), (4,5), (1,5), (5,2)])
# out: [[1, 5, 2]]
but the problem actually is that complex braids can be in different topological configurations and therefore it doesn't really make sense to define all possible topological configurations, unless you can describe them with one (or few) patterns or you can find a condition that signify the presence of complex braid.


Drawing graph with labels in networkx obtained from a py2neo query

I am running some data analysis with a Jupyter notebook where I have a query with a variable length matching like this one:
MATCH p=(s:Skill)-[:BROADER*0..3]->(s)
WHERE s.label='py2neo' or s.label='Python'
I would like to plot its result as a graph, using networkx.
So far I have found two unsatisfactory solutions. Based on an notebook here, I can generate a graph using cypher magic whose result is directly understood by the networkx module.
result = %cypher MATCH p=(s:Skill)-[:BROADER*0..3]->(s) WHERE s.label='py2neo' or s.label='Python' RETURN p
However, then I am unable to find a way to add the labels to the plot.
That solution bypasses py2neo. With py2neo I can put labels on a graph, as long as I don't use a variable length pattern.
query='''MATCH p=(s1:Skill)-[:BROADER]->(s2)
WHERE s1.label='py2neo' or s1.label='Python'
RETURN s1.label as child, s2.label as parent'''
df =
And then, copying from a response here in Stackoverflow (which I will link later) I can build the graph manually
G.add_nodes_from(list(set(list(df.iloc[:,0]) + list(df.iloc[:,1]))))
#Add edges
tuples = [tuple(x) for x in df.values]
#Perform Graph Drawing
#A star network (sort of)
With this I get a graph with labels, but to get something like the variable length matching I should use multiple queries.
But how can I get the best of both worlds? I would prefer a py2neo solution. Rephrasing: How can I get py2neo to return a graph (not a table) and then be able to pass such information to networkx, being able to determine which, from the multiple possible labels, are the ones to be shown in the graph?
The question at the end was how can I get a table containing all the edges out of a subgraph that matches a certain query.
The Cypher that does the trick is:
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
The first two lines get the nodes belonging to said subgraph. The last five unwind it as pairs of nodes connected by a directed edge.
The 0 in the second MATCH allows for collecting isolated nodes that belong to the original list.
as in 2019, with current py2neopackages, a way that this thing would work is
query = '''
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
df = pd.DataFrame(
G.add_nodes_from(list(set(list(df['child']) + list(df.loc['parent']))))
#Add edges
tuples = [tuple(x) for x in df.values]
#Perform Graph Drawing
#A star network (sort of)

Prohibitively slow execution of function compute_resilience in Python

The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
return res[-1][0]
def compute_resilience(ugraph, attack_order):
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
So the revised code is about 200 times faster on this test case.

Proper subgraphing of a PySpark GraphFrame

graphframes is a network analysis tool based on PySpark DataFrames. The following code is a modified version of the tutorial subgraphing example:
from graphframes.examples import Graphs
import graphframes
g = Graphs(sqlContext).friends() # Get example graph
# Select subgraph of users older than 30
v2 = g.vertices.filter("age > 30")
g2 = graphframes.GraphFrame(v2, g.edges)
One would expect that the new graph, g2 will contain fewer nodes and fewer edges, compared to the original one, g.
However, this is not the case:
print(g.vertices.count(), g.edges.count())
print(g2.vertices.count(), g2.edges.count())
Gives the output:
(6, 7)
(7, 4)
It is obvious that the resulting graph contains edges for non-existing nodes.
Even more disturbing is the fact that g.degrees and g2.degrees are identical. This means that at least some of graph functionality ignores the nodes information. Is there a good way to make sure that GraphFrame creates
a graph using only the intersection of the supplied nodes and edges arguments?
A method that I use to subgraph a graphframe is using motifs:
motifs = g.find("(a)-[e]->(b)").filter(<conditions for a,b or e>)
new_vertices = sqlContext.createDataFrame( row: row.a).union( row: row.b)).distinct())
new_edges = sqlContext.createDataFrame( row:row.e).distinct())
new_graph = GraphFrame(new_vertices,new_edges)
While this looks more complicated and possibly takes longer in terms of runtime, for more complicated graph queries, this serves well as you interact with the graphframe as a single entity rather than as vertices and edges being separate. So, filtering on vertices also influences edges left in the graphframe.
Interesting.. I'm not able to see that result:
>>> from graphframes.examples import Graphs
>>> import graphframes
>>> g = Graphs(sqlContext).friends() # Get example graph
>>> # Select subgraph of users older than 30
... v2 = g.vertices.filter("age > 30")
>>> g2 = graphframes.GraphFrame(v2, g.edges)
>>> print(g.vertices.count(), g.edges.count())
(6, 7)
>>> print(g2.vertices.count(), g2.edges.count())
(4, 7)
GraphFrames as of now does not check if the graph is valid - ie. all the edges are connects to vertices and so on, at graph construction time. But seems like the number of vertices is correct after the filter?
My work-arounds may not be the perfect ones, but they work for me.
Problem statement as I got it: having a filtered collection of nodes filtered_nodes, we only want to have the edges from the original graph that include nodes from filtered_nodes.
Method 1: Using joins (costly)
edgesframe = graphframe.edges
src_join = edgesframe.join(filtered_nodes, (edgesframe.src ==, "inner").withColumnRenamed("src", "srcto")
dst_join = edgesframe.join(filtered_nodes, (edgesframe.dst ==, "inner").withColumnRenamed("dst", "dstto")
final_join = src_join.join(dst_join, (src_join.src == dst_join.src) & (src_join.dst == dst_join.dst), "inner").select("src", "dst")
g2 = GraphFrame(filtered_nodes, final_join)
Method 2: Using collected collection as a list-reference for isin-method (I'd only use it on small collections of filter nodes)
edgesframe = graphframe.edges
collected_nodes ="columnWeUseForReference") r: r[0]).collect()
edgs = edgesframe.filter(edgesframe.src.isin(collected_nodes) & edgesframe.dst.isin(collected_nodes))
Does someone have a better approach? I'd be really happy to see it.
I recommend using dropIsolatedVertices().

Finding Successors of Successors in a Directed Graph in NetworkX

I'm working on some code for a directed graph in NetworkX, and have hit a block that's likely the result of my questionable programming experience. What I'm trying to do is the following:
I have a directed graph G, with two "parent nodes" at the top, from which all other nodes flow. When graphing this network, I'd like to graph every node that is a descendant of "Parent 1" one color, and all the other nodes another color. Which means I need a list Parent 1's successors.
Right now, I can get the first layer of them easily using:
descend= G.successors(parent1)
The problem is this only gives me the first generation of successors. Preferably, I want the successors of successors, the successors of the successors of the successors, etc. Arbitrarily, because it would be extremely useful to be able to run the analysis and make the graph without having to know exactly how many generations are in it.
Any idea how to approach this?
You don't need a list of descendents, you just want to color them. For that you just have to pick a algorithm that traverses the graph and use it to color the edges.
For example, you can do
from networkx.algorithms.traversal.depth_first_search import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
If you want to get all the successor nodes, without passing through edges, another way could be:
import networkx as nx
G = DiGraph( ... )
successors = nx.nodes(nx.dfs_tree(G, your_node))
I noticed that if you call instead:
successors = list(nx.dfs_successors(G, your_node)
the nodes of the bottom level are somehow not included.
Well, the successor of successor is just the successor of the descendants right?
# First successors
descend = G.successors(parent1)
# 2nd level successors
def allDescendants(d1):
d2 = []
for d in d1:
d2 += G.successors(d)
return d2
descend2 = allDescendants(descend)
To get level 3 descendants, call allDescendants(d2) etc.
Issue 1:
allDescend = descend + descend2 gives you the two sets combined, do the same for further levels of descendants.
Issue2: If you have loops in your graph, then you need to first modify the code to test if you've visited that descendant before, e.g:
def allDescendants(d1, exclude):
d2 = []
for d in d1:
d2 += filter(lambda s: s not in exclude, G.successors(d))
return d2
This way, you pass allDescend as the second argument to the above function so it's not included in future descendants. You keep doing this until allDescandants() returns an empty array in which case you know you've explored the entire graph, and you stop.
Since this is starting to look like homework, I'll let you figure out how to piece all this together on your own. ;)
So that the answer is somewhat cleaner and easier to find for future folks who stumble upon it, here's the code I ended up using:
G = DiGraph() # Creates an empty directed graph G
infile = open(sys.argv[1])
for edge in infile:
edge1, edge2 = edge.split() #Splits data on the space
node1 = int(edge1) #Creates integer version of the node names
node2 = int(edge2)
G.add_edge(node1,node2) #Adds an edge between two nodes
data_successors = dfs_successors(G,parent1)
successor_list = data_successors.values()
allsuccessors = [item for sublist in successor_list for item in sublist]
pos = graphviz_layout(G,prog='dot')
draw_networkx_nodes(G,pos,nodelist=allsuccessors, node_color="SkyBlue")
I believe Networkx has changed since #Jochen Ritzel 's answer a few years ago.
Now the following holds, only changing the import statement.
import networkx
from networkx import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
descendents = sum(nx.dfs_successors(G, parent).values(), [])
nx.descendants(G, parent)
more details:

How should I create my object so it works with networkx pretty well?

I'm trying to design a project that takes global positioning data, like city and state names along with latitudes and locations. I'll also have distances between every pair of cities. I want to make a graph with all of this information, and manipulate it to perform some graph algorithms. I've decided to have city objects which contains each location's data. Now should I have a hash function to differentiate objects? And how should I handle graph algorithms that combine nodes and remove edges?
def minCut(self):
"""Returns the lowest-cost set of edges that will disconnect a graph"""
smcut = (float('infinity'), None)
cities = self.__selectedcities[:]
edges = self.__selectededges[:]
g = self.__makeGRAPH(cities, edges)
if not nx.is_connected(g):
print("The graph is already diconnected!")
while len(g.nodes()) >1:
stphasecut = self.mincutphase(g)
if stphasecut[2] < smcut:
smcut = (stphasecut[2], None)
self.__merge(g, stphasecut[0], stphasecut[1])
print("Weight of the min-cut: "+str(smcut[1]))
It's in really bad shape. I'm rewriting my original program, but this is the approach i took from the previous version.
Depending on what version of networkx you have installed, there is a built-in implementation of min_cut available.
I had the 1.0RC1 package installed and that was not available.. but I upgraded to 1.4 and min_cut is there.
Here's a (silly) example:
import networkx as nx
g = nx.DiGraph()
g.add_nodes_from(['London', 'Boston', 'NY', 'Dallas'])
g.add_edge('NY', 'Boston', capacity)
g.add_edge('Dallas', 'Boston')
g.add_edge('Dallas', 'London')
# add capacity to existing edge
g.edge['Dallas']['London']['capacity'] = 2
# create edge with capacity attribute
g.add_edge('NY', 'London', capacity=3)
print nx.min_cut(g, 'NY', 'London')
You don't need to create a hash function for the city objects, you can pass the city object directly to Networkx - from the tutorial "nodes can be any hashable object e.g. a text string, an image, an XML object, another Graph, a customized node object, etc."
You can iterate over the list of cities and add them as nodes and then iterate the distance information to make a graph.
Have you looked at the tutorial?
As you merge nodes, you can create new nodes and hash these new nodes to list of cities that have been merged. For example, in the above code you can name the new node len(g) and hash it to stphasecut[0]+ stphasecut[1] # assuming stphasecut[1] and[2] are lists.
