I am running some data analysis with a Jupyter notebook where I have a query with a variable length matching like this one:
MATCH p=(s:Skill)-[:BROADER*0..3]->(s)
WHERE s.label='py2neo' or s.label='Python'
RETURN p
I would like to plot its result as a graph, using networkx.
So far I have found two unsatisfactory solutions. Based on an notebook here, I can generate a graph using cypher magic whose result is directly understood by the networkx module.
result = %cypher MATCH p=(s:Skill)-[:BROADER*0..3]->(s) WHERE s.label='py2neo' or s.label='Python' RETURN p
nx.draw(result.get_graph())
However, then I am unable to find a way to add the labels to the plot.
That solution bypasses py2neo. With py2neo I can put labels on a graph, as long as I don't use a variable length pattern.
Example:
query='''MATCH p=(s1:Skill)-[:BROADER]->(s2)
WHERE s1.label='py2neo' or s1.label='Python'
RETURN s1.label as child, s2.label as parent'''
df = sgraph.data(query)
And then, copying from a response here in Stackoverflow (which I will link later) I can build the graph manually
G=nx.DiGraph()
G.add_nodes_from(list(set(list(df.iloc[:,0]) + list(df.iloc[:,1]))))
#Add edges
tuples = [tuple(x) for x in df.values]
G.add_edges_from(tuples)
G.number_of_edges()
#Perform Graph Drawing
#A star network (sort of)
nx.draw_networkx(G)
plt.show()
With this I get a graph with labels, but to get something like the variable length matching I should use multiple queries.
But how can I get the best of both worlds? I would prefer a py2neo solution. Rephrasing: How can I get py2neo to return a graph (not a table) and then be able to pass such information to networkx, being able to determine which, from the multiple possible labels, are the ones to be shown in the graph?
The question at the end was how can I get a table containing all the edges out of a subgraph that matches a certain query.
The Cypher that does the trick is:
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
WITH COLLECT(DISTINCT source)+COLLECT(dest) AS myNodes
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
The first two lines get the nodes belonging to said subgraph. The last five unwind it as pairs of nodes connected by a directed edge.
The 0 in the second MATCH allows for collecting isolated nodes that belong to the original list.
as in 2019, with current py2neopackages, a way that this thing would work is
query = '''
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
WITH COLLECT(DISTINCT source)+COLLECT(dest) AS myNodes
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
'''
df = pd.DataFrame(graph.run(query).data())
G=nx.DiGraph()
G.add_nodes_from(list(set(list(df['child']) + list(df.loc['parent']))))
#Add edges
tuples = [tuple(x) for x in df.values]
G.add_edges_from(tuples)
G.number_of_edges()
#Perform Graph Drawing
#A star network (sort of)
nx.draw_networkx(G)
plt.show()
Related
I am learning to use networkx and I am attempting to apply it on large graph. I have a edge list downloaded from https://biomine.ijs.si/downloads/, the data (biomine_3_oct_2018.bmg) is plain text and zipped (biomine_3_oct_2018.zip), it has 1520673 nodes, 32761889 edges. Here is a brief overview of the data head:
# _symmetric overlaps
# _symmetric is_related_to
# _symmetric has_synonym
# _symmetric is_homologous_to
# _symmetric functionally_associated_to
# _symmetric interacts_with
BiologicalProcess_GO:GO:0000001 BiologicalProcess_GO:GO:0048308 is_a 0.414
BiologicalProcess_GO:GO:0000001 BiologicalProcess_GO:GO:0048311 is_a 0.566
BiologicalProcess_GO:GO:0000002 BiologicalProcess_GO:GO:0007005 is_a 0.439
BiologicalProcess_GO:GO:0000003 BiologicalProcess_GO:GO:0008150 is_a 0.345
...
Column meanings are
node1 node2 link_type link_goodness
I created a pandas dataframe from this data and I ignored the first 6 lines (start with #). The length of the pandas dataframe is correct (32761889) but when I use networkx to generate a directed graph:
Biomine_data = pd.read_csv('biomine_3_oct_2018.bmg', sep=" ", header=None)
Biomine_data.columns = ["from","to","relation","weightProb"]
G = net.from_pandas_edgelist(Biomine_data, 'from', 'to', ["relation","weightProb"], create_using=net.DiGraph())
I got 32753853, which is 8036 less than the correct number (32761889). Also, the number of nodes is 12086 smaller too, I got 1508587 nodes but it should be 1520673.
len(G.edges)
# return 32753853
len(G.nodes)
# return 1508587
I thought about the possibility that the graph might have some symmetric edges (e.g. A->B & B->A), but even so, the number of nodes shouldn’t be less right?
I also tried to create an undirected graph and got different wrong result for number of edges, number of nodes is the same as directed graph, but still wrong:
G_prime = net.from_pandas_edgelist(Biomine_data, 'from', 'to', ["relation","weightProb"], create_using=net.Graph())
len(G_prime.edges)
# return 32744545
len(G_prime.nodes)
# return 1508587
Does anyone know what could be the reason for this error? Or is there something about the way networkx works that I am not aware of?
P.S. I read something like Nodes in nbunch that are not in the graph will be (quietly) ignored. For directed graphs this returns the out-edges. from https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.edges.html, could this be the reason? If so, how do I fix that?
Thank you!
I am writing a QGIS plugin which will use the NetworkX library to manipulate and analyze stream networks. My data comes from shapefiles representing stream networks.
(arrows represent direction of stream flow)
Within this stream network are braids which are important features I need to retain. I am categorizing braid features into "simple" (two edges that share two nodes) and "complex" (more than two edges, with more than two nodes).
Simple braid example
Complex braid example
Normally, I would just use the NetworkX built-in function read_shp to import the shapefile as a DiGraph. As is evident in the examples, the "simple" braid will be considered a parallel edge in a NetworkX DiGraph, because those two edges (which share the same to and from nodes) would be collapsed into a single edge. In order to preserve these multiple edges, we wrote a function that imports a shapefile as a MultiDiGraph. Simple braids (i.e. parallel edges) are preserved by using unique keys in the edge objects (this is embedded in a class):
def _shp_to_nx(self, in_network_lyr, simplify=True, geom_attrs=True):
"""
This is a re-purposed version of read_shp from the NetworkX library.
:param shapelayer:
:param simplify:
:param geom_attrs:
:return:
"""
self.G = nx.MultiDiGraph()
for f in in_network_lyr.getFeatures():
flddata = f.attributes()
fields = [str(fi.name()) for fi in f.fields()]
geo = f.geometry()
# We don't care about M or Z
geo.geometry().dropMValue()
geo.geometry().dropZValue()
attributes = dict(zip(fields, flddata))
# Add a new _FID_ field
fid = int(f.id())
attributes[self.id_field] = fid
attributes['_calc_len_'] = geo.length()
# Note: Using layer level geometry type
if geo.wkbType() in (QgsWKBTypes.LineString, QgsWKBTypes.MultiLineString):
for edge in self.edges_from_line(geo, attributes, simplify, geom_attrs):
e1, e2, attr = edge
self.features[fid] = attr
self.G.add_edge(tuple(e1), tuple(e2), key=attr[self.id_field], attr_dict=attr)
self.cols = self.features[self.features.keys()[0]].keys()
else:
raise ImportError("GeometryType {} not supported. For now we only support LineString types.".
format(QgsWKBTypes.displayString(int(geo.wkbType()))))
I have already written a function to find the "simple" braid features (I just iterate through the MultiDiGraphs nodes, and find edges with more than one key). But I also need to find the "complex" braids. Normally, in a Graph, I could use the cycle_basis to find all of the "complex" braids (i.e. cycles), however, the cycle_basis method only works on un-directed Graphs, not directional graphs. But I'd rather not convert my MultiDiGraph into an un-directed Graph, as there can be unexpected results associated with that conversion (not to mention losing my edge key values).
How could I go about finding cycles which are made up of more than one edge, in a relatively time-efficient way? The stream networks I'm really working with can be quite large and complex, representing large watersheds.
Thanks!
So I came up with a solution, for finding both "simple" and "complex" braids.
def get_complex_braids(self, G, attrb_field, attrb_name):
"""
Create graph with the braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
"""
if nx.is_directed(G):
UG = nx.Graph(G)
braid_G = nx.MultiDiGraph()
for edge in G.edges(data=True, keys=True):
is_edge = self.get_edge_in_cycle(edge, UG)
if is_edge == True:
braid_G.add_edge(*edge)
self.update_attribute(braid_G, attrb_field, attrb_name)
return braid_G
else:
print "ERROR: Graph is not directed."
braid_complex_G = nx.null_graph()
return braid_complex_G
def get_simple_braids(self, G, attrb_field, attrb_name):
"""
Create graph with the simple braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
"""
braid_simple_G = nx.MultiDiGraph()
parallel_edges = []
for e in G.edges_iter():
keys = G.get_edge_data(*e).keys()
if keys not in parallel_edges:
if len(keys) == 2:
for k in keys:
data = G.get_edge_data(*e, key=k)
braid_simple_G.add_edge(e[0], e[1], key=k, attr_dict=data)
parallel_edges.append(keys)
self.update_attribute(braid_simple_G, attrb_field, attrb_name)
return braid_simple_G
This is not a definite answer, but longer than maximum allowed characters for a comment, so I post it here anyway.
To find simple braids, you can use built-in methods G.selfloop_edges and G.nodes_with_selfloops.
I haven't heard about cycle_basis for directed graphs, can you provide a reference (e.g. scientific work)? NetworkX has simple_cycles(G) which works on directed Graphs, but it is also not useful in this case, because water does not visit any node twice (or?).
I am afraid that the only way is to precisely describe the topology and then search the graph to find matching occurrences. let me clarify my point with an example. the following function should be able to identify instances of complex braids similar to your example:
def Complex_braid(G):
res = []
# find all nodes with out_degree greater than one:
candidates = [n for n in G.nodes() if len(G.successors(n)) > 1]
# find successors:
for n in candidates:
succ = G.successors(n)
for s in succ:
if len(list(nx.all_simple_paths(G,n,s))) > 1:
all_nodes = sorted(list(nx.all_simple_paths(G,n,s)), key=len)[-1]
res.append(all_nodes)
return res
G = nx.MultiDiGraph()
G.add_edges_from([(0,1), (1,2), (2,3), (4,5), (1,5), (5,2)])
Complex_braid(G)
# out: [[1, 5, 2]]
but the problem actually is that complex braids can be in different topological configurations and therefore it doesn't really make sense to define all possible topological configurations, unless you can describe them with one (or few) patterns or you can find a condition that signify the presence of complex braid.
graphframes is a network analysis tool based on PySpark DataFrames. The following code is a modified version of the tutorial subgraphing example:
from graphframes.examples import Graphs
import graphframes
g = Graphs(sqlContext).friends() # Get example graph
# Select subgraph of users older than 30
v2 = g.vertices.filter("age > 30")
g2 = graphframes.GraphFrame(v2, g.edges)
One would expect that the new graph, g2 will contain fewer nodes and fewer edges, compared to the original one, g.
However, this is not the case:
print(g.vertices.count(), g.edges.count())
print(g2.vertices.count(), g2.edges.count())
Gives the output:
(6, 7)
(7, 4)
It is obvious that the resulting graph contains edges for non-existing nodes.
Even more disturbing is the fact that g.degrees and g2.degrees are identical. This means that at least some of graph functionality ignores the nodes information. Is there a good way to make sure that GraphFrame creates
a graph using only the intersection of the supplied nodes and edges arguments?
A method that I use to subgraph a graphframe is using motifs:
motifs = g.find("(a)-[e]->(b)").filter(<conditions for a,b or e>)
new_vertices = sqlContext.createDataFrame(motifs.map(lambda row: row.a).union(motifs.map(lambda row: row.b)).distinct())
new_edges = sqlContext.createDataFrame(motifs.map(lambda row:row.e).distinct())
new_graph = GraphFrame(new_vertices,new_edges)
While this looks more complicated and possibly takes longer in terms of runtime, for more complicated graph queries, this serves well as you interact with the graphframe as a single entity rather than as vertices and edges being separate. So, filtering on vertices also influences edges left in the graphframe.
Interesting.. I'm not able to see that result:
>>> from graphframes.examples import Graphs
>>> import graphframes
>>> g = Graphs(sqlContext).friends() # Get example graph
>>> # Select subgraph of users older than 30
... v2 = g.vertices.filter("age > 30")
>>> g2 = graphframes.GraphFrame(v2, g.edges)
>>> print(g.vertices.count(), g.edges.count())
(6, 7)
>>> print(g2.vertices.count(), g2.edges.count())
(4, 7)
GraphFrames as of now does not check if the graph is valid - ie. all the edges are connects to vertices and so on, at graph construction time. But seems like the number of vertices is correct after the filter?
My work-arounds may not be the perfect ones, but they work for me.
Problem statement as I got it: having a filtered collection of nodes filtered_nodes, we only want to have the edges from the original graph that include nodes from filtered_nodes.
Method 1: Using joins (costly)
edgesframe = graphframe.edges
src_join = edgesframe.join(filtered_nodes, (edgesframe.src == subgraph_nodes.id), "inner").withColumnRenamed("src", "srcto")
dst_join = edgesframe.join(filtered_nodes, (edgesframe.dst == subgraph_nodes.id), "inner").withColumnRenamed("dst", "dstto")
final_join = src_join.join(dst_join, (src_join.src == dst_join.src) & (src_join.dst == dst_join.dst), "inner").select("src", "dst")
g2 = GraphFrame(filtered_nodes, final_join)
Method 2: Using collected collection as a list-reference for isin-method (I'd only use it on small collections of filter nodes)
edgesframe = graphframe.edges
collected_nodes = subgraph_nodes.select("columnWeUseForReference").rdd.map(lambda r: r[0]).collect()
edgs = edgesframe.filter(edgesframe.src.isin(collected_nodes) & edgesframe.dst.isin(collected_nodes))
Does someone have a better approach? I'd be really happy to see it.
I recommend using dropIsolatedVertices().
I'm working on some code for a directed graph in NetworkX, and have hit a block that's likely the result of my questionable programming experience. What I'm trying to do is the following:
I have a directed graph G, with two "parent nodes" at the top, from which all other nodes flow. When graphing this network, I'd like to graph every node that is a descendant of "Parent 1" one color, and all the other nodes another color. Which means I need a list Parent 1's successors.
Right now, I can get the first layer of them easily using:
descend= G.successors(parent1)
The problem is this only gives me the first generation of successors. Preferably, I want the successors of successors, the successors of the successors of the successors, etc. Arbitrarily, because it would be extremely useful to be able to run the analysis and make the graph without having to know exactly how many generations are in it.
Any idea how to approach this?
You don't need a list of descendents, you just want to color them. For that you just have to pick a algorithm that traverses the graph and use it to color the edges.
For example, you can do
from networkx.algorithms.traversal.depth_first_search import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
See https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.traversal.depth_first_search.dfs_edges.html?highlight=traversal
If you want to get all the successor nodes, without passing through edges, another way could be:
import networkx as nx
G = DiGraph( ... )
successors = nx.nodes(nx.dfs_tree(G, your_node))
I noticed that if you call instead:
successors = list(nx.dfs_successors(G, your_node)
the nodes of the bottom level are somehow not included.
Well, the successor of successor is just the successor of the descendants right?
# First successors
descend = G.successors(parent1)
# 2nd level successors
def allDescendants(d1):
d2 = []
for d in d1:
d2 += G.successors(d)
return d2
descend2 = allDescendants(descend)
To get level 3 descendants, call allDescendants(d2) etc.
Edit:
Issue 1:
allDescend = descend + descend2 gives you the two sets combined, do the same for further levels of descendants.
Issue2: If you have loops in your graph, then you need to first modify the code to test if you've visited that descendant before, e.g:
def allDescendants(d1, exclude):
d2 = []
for d in d1:
d2 += filter(lambda s: s not in exclude, G.successors(d))
return d2
This way, you pass allDescend as the second argument to the above function so it's not included in future descendants. You keep doing this until allDescandants() returns an empty array in which case you know you've explored the entire graph, and you stop.
Since this is starting to look like homework, I'll let you figure out how to piece all this together on your own. ;)
So that the answer is somewhat cleaner and easier to find for future folks who stumble upon it, here's the code I ended up using:
G = DiGraph() # Creates an empty directed graph G
infile = open(sys.argv[1])
for edge in infile:
edge1, edge2 = edge.split() #Splits data on the space
node1 = int(edge1) #Creates integer version of the node names
node2 = int(edge2)
G.add_edge(node1,node2) #Adds an edge between two nodes
parent1=int(sys.argv[2])
parent2=int(sys.argv[3])
data_successors = dfs_successors(G,parent1)
successor_list = data_successors.values()
allsuccessors = [item for sublist in successor_list for item in sublist]
pos = graphviz_layout(G,prog='dot')
plt.figure(dpi=300)
draw_networkx_nodes(G,pos,node_color="LightCoral")
draw_networkx_nodes(G,pos,nodelist=allsuccessors, node_color="SkyBlue")
draw_networkx_edges(G,pos,arrows=False)
draw_networkx_labels(G,pos,font_size=6,font_family='sans-serif',labels=labels)
I believe Networkx has changed since #Jochen Ritzel 's answer a few years ago.
Now the following holds, only changing the import statement.
import networkx
from networkx import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
Oneliner:
descendents = sum(nx.dfs_successors(G, parent).values(), [])
nx.descendants(G, parent)
more details: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.descendants.html
I'm trying to design a project that takes global positioning data, like city and state names along with latitudes and locations. I'll also have distances between every pair of cities. I want to make a graph with all of this information, and manipulate it to perform some graph algorithms. I've decided to have city objects which contains each location's data. Now should I have a hash function to differentiate objects? And how should I handle graph algorithms that combine nodes and remove edges?
def minCut(self):
"""Returns the lowest-cost set of edges that will disconnect a graph"""
smcut = (float('infinity'), None)
cities = self.__selectedcities[:]
edges = self.__selectededges[:]
g = self.__makeGRAPH(cities, edges)
if not nx.is_connected(g):
print("The graph is already diconnected!")
return
while len(g.nodes()) >1:
stphasecut = self.mincutphase(g)
if stphasecut[2] < smcut:
smcut = (stphasecut[2], None)
self.__merge(g, stphasecut[0], stphasecut[1])
print("Weight of the min-cut: "+str(smcut[1]))
It's in really bad shape. I'm rewriting my original program, but this is the approach i took from the previous version.
Depending on what version of networkx you have installed, there is a built-in implementation of min_cut available.
I had the 1.0RC1 package installed and that was not available.. but I upgraded to 1.4 and min_cut is there.
Here's a (silly) example:
import networkx as nx
g = nx.DiGraph()
g.add_nodes_from(['London', 'Boston', 'NY', 'Dallas'])
g.add_edge('NY', 'Boston', capacity)
g.add_edge('Dallas', 'Boston')
g.add_edge('Dallas', 'London')
# add capacity to existing edge
g.edge['Dallas']['London']['capacity'] = 2
# create edge with capacity attribute
g.add_edge('NY', 'London', capacity=3)
print nx.min_cut(g, 'NY', 'London')
You don't need to create a hash function for the city objects, you can pass the city object directly to Networkx - from the tutorial "nodes can be any hashable object e.g. a text string, an image, an XML object, another Graph, a customized node object, etc."
You can iterate over the list of cities and add them as nodes and then iterate the distance information to make a graph.
Have you looked at the tutorial? http://networkx.lanl.gov/tutorial/tutorial.html
As you merge nodes, you can create new nodes and hash these new nodes to list of cities that have been merged. For example, in the above code you can name the new node len(g) and hash it to stphasecut[0]+ stphasecut[1] # assuming stphasecut[1] and[2] are lists.