Proper subgraphing of a PySpark GraphFrame

Proper subgraphing of a PySpark GraphFrame - python

graphframes is a network analysis tool based on PySpark DataFrames. The following code is a modified version of the tutorial subgraphing example:
from graphframes.examples import Graphs
import graphframes
g = Graphs(sqlContext).friends() # Get example graph
# Select subgraph of users older than 30
v2 = g.vertices.filter("age > 30")
g2 = graphframes.GraphFrame(v2, g.edges)
One would expect that the new graph, g2 will contain fewer nodes and fewer edges, compared to the original one, g.
However, this is not the case:
print(g.vertices.count(), g.edges.count())
print(g2.vertices.count(), g2.edges.count())
Gives the output:
(6, 7)
(7, 4)
It is obvious that the resulting graph contains edges for non-existing nodes.
Even more disturbing is the fact that g.degrees and g2.degrees are identical. This means that at least some of graph functionality ignores the nodes information. Is there a good way to make sure that GraphFrame creates
a graph using only the intersection of the supplied nodes and edges arguments?

A method that I use to subgraph a graphframe is using motifs:
motifs = g.find("(a)-[e]->(b)").filter(<conditions for a,b or e>)
new_vertices = sqlContext.createDataFrame(motifs.map(lambda row: row.a).union(motifs.map(lambda row: row.b)).distinct())
new_edges = sqlContext.createDataFrame(motifs.map(lambda row:row.e).distinct())
new_graph = GraphFrame(new_vertices,new_edges)
While this looks more complicated and possibly takes longer in terms of runtime, for more complicated graph queries, this serves well as you interact with the graphframe as a single entity rather than as vertices and edges being separate. So, filtering on vertices also influences edges left in the graphframe.

Interesting.. I'm not able to see that result:
>>> from graphframes.examples import Graphs
>>> import graphframes
>>> g = Graphs(sqlContext).friends() # Get example graph
>>> # Select subgraph of users older than 30
... v2 = g.vertices.filter("age > 30")
>>> g2 = graphframes.GraphFrame(v2, g.edges)
>>> print(g.vertices.count(), g.edges.count())
(6, 7)
>>> print(g2.vertices.count(), g2.edges.count())
(4, 7)
GraphFrames as of now does not check if the graph is valid - ie. all the edges are connects to vertices and so on, at graph construction time. But seems like the number of vertices is correct after the filter?

My work-arounds may not be the perfect ones, but they work for me.
Problem statement as I got it: having a filtered collection of nodes filtered_nodes, we only want to have the edges from the original graph that include nodes from filtered_nodes.
Method 1: Using joins (costly)
edgesframe = graphframe.edges
src_join = edgesframe.join(filtered_nodes, (edgesframe.src == subgraph_nodes.id), "inner").withColumnRenamed("src", "srcto")
dst_join = edgesframe.join(filtered_nodes, (edgesframe.dst == subgraph_nodes.id), "inner").withColumnRenamed("dst", "dstto")
final_join = src_join.join(dst_join, (src_join.src == dst_join.src) & (src_join.dst == dst_join.dst), "inner").select("src", "dst")
g2 = GraphFrame(filtered_nodes, final_join)
Method 2: Using collected collection as a list-reference for isin-method (I'd only use it on small collections of filter nodes)
edgesframe = graphframe.edges
collected_nodes = subgraph_nodes.select("columnWeUseForReference").rdd.map(lambda r: r[0]).collect()
edgs = edgesframe.filter(edgesframe.src.isin(collected_nodes) & edgesframe.dst.isin(collected_nodes))
Does someone have a better approach? I'd be really happy to see it.

I recommend using dropIsolatedVertices().

Related

how to create a network with node and edges from nested data structure?

I have a nested data structure, in which each element can be either iterable or not. I would like to build a graph which transforms this nested data structure in a network (I thought of the networkx package for that). Each element is a Tuple with (ID, value) in which value can be an integer or an Iterable.
My final graph should look something like this, in which each arrow is like an edge to all the indented elements (i.e. mainbox1 is connected to bigbox2, smallbox3, mediumbox4)
mainbox1 -->
bigbox2 -->
mediumbox5
smallbox6
smallbox3
mediumbox4 -->
smallbox7
I struggle to create an algorithm that does what I want. I thought it should be recursive (add each item until there is no more nesting) but I did not succeed in writing the implementation.
This was my starting point.
import networkx as nx
example = [('mainbox1',[('bigbox2', [('mediumbox5'),
('smallbox6')]),
('smallbox3'),
('mediumbox4', ('smallbox7'))
] )]

There is some problems with tuples in your example data. I made some corrections and this code works
import networkx as nx
def rec_make(g, root, nodes):
for node in nodes:
g.add_edge(root, node[0])
if isinstance(node[1], list):
rec_make(g, node[0], node[1])
def main():
g = nx.Graph()
example = [('mainbox1', [('bigbox2', [
('mediumbox5', 5),
('smallbox6', 6)
]), ('smallbox3', 3), ('mediumbox4', [
('smallbox7', 7)
])])]
rec_make(g, example[0][0], example[0][1])
print("Nodes in G: ", g.nodes())
print("Edges in G: ", g.edges())
Your are getting exactly what you want:
Nodes in G: ['mainbox1', 'bigbox2', 'mediumbox5', 'smallbox6', 'smallbox3', 'mediumbox4', 'smallbox7']
Edges in G: [('mainbox1', 'bigbox2'), ('mainbox1', 'smallbox3'), ('mainbox1', 'mediumbox4'), ('bigbox2', 'mediumbox5'), ('bigbox2', 'smallbox6'), ('mediumbox4', 'smallbox7')]

Drawing graph with labels in networkx obtained from a py2neo query

I am running some data analysis with a Jupyter notebook where I have a query with a variable length matching like this one:
MATCH p=(s:Skill)-[:BROADER*0..3]->(s)
WHERE s.label='py2neo' or s.label='Python'
RETURN p
I would like to plot its result as a graph, using networkx.
So far I have found two unsatisfactory solutions. Based on an notebook here, I can generate a graph using cypher magic whose result is directly understood by the networkx module.
result = %cypher MATCH p=(s:Skill)-[:BROADER*0..3]->(s) WHERE s.label='py2neo' or s.label='Python' RETURN p
nx.draw(result.get_graph())
However, then I am unable to find a way to add the labels to the plot.
That solution bypasses py2neo. With py2neo I can put labels on a graph, as long as I don't use a variable length pattern.
Example:
query='''MATCH p=(s1:Skill)-[:BROADER]->(s2)
WHERE s1.label='py2neo' or s1.label='Python'
RETURN s1.label as child, s2.label as parent'''
df = sgraph.data(query)
And then, copying from a response here in Stackoverflow (which I will link later) I can build the graph manually
G=nx.DiGraph()
G.add_nodes_from(list(set(list(df.iloc[:,0]) + list(df.iloc[:,1]))))
#Add edges
tuples = [tuple(x) for x in df.values]
G.add_edges_from(tuples)
G.number_of_edges()
#Perform Graph Drawing
#A star network (sort of)
nx.draw_networkx(G)
plt.show()
With this I get a graph with labels, but to get something like the variable length matching I should use multiple queries.
But how can I get the best of both worlds? I would prefer a py2neo solution. Rephrasing: How can I get py2neo to return a graph (not a table) and then be able to pass such information to networkx, being able to determine which, from the multiple possible labels, are the ones to be shown in the graph?

The question at the end was how can I get a table containing all the edges out of a subgraph that matches a certain query.
The Cypher that does the trick is:
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
WITH COLLECT(DISTINCT source)+COLLECT(dest) AS myNodes
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
The first two lines get the nodes belonging to said subgraph. The last five unwind it as pairs of nodes connected by a directed edge.
The 0 in the second MATCH allows for collecting isolated nodes that belong to the original list.
as in 2019, with current py2neopackages, a way that this thing would work is
query = '''
MATCH (source:Skill)-[:BROADER*0..7]->(dest:Skill)
WHERE source.label_en in ['skill1','skill2']
WITH COLLECT(DISTINCT source)+COLLECT(dest) AS myNodes
UNWIND myNodes as myNode
MATCH p=(myNode)-[:BROADER]->(neighbor)
WHERE neighbor in myNodes
RETURN myNode.label_en as child ,neighbor.label_en as parent
'''
df = pd.DataFrame(graph.run(query).data())
G=nx.DiGraph()
G.add_nodes_from(list(set(list(df['child']) + list(df.loc['parent']))))
#Add edges
tuples = [tuple(x) for x in df.values]
G.add_edges_from(tuples)
G.number_of_edges()
#Perform Graph Drawing
#A star network (sort of)
nx.draw_networkx(G)
plt.show()

Use NetworkX to find cycles in MultiDiGraph imported from shapefile

I am writing a QGIS plugin which will use the NetworkX library to manipulate and analyze stream networks. My data comes from shapefiles representing stream networks.
(arrows represent direction of stream flow)
Within this stream network are braids which are important features I need to retain. I am categorizing braid features into "simple" (two edges that share two nodes) and "complex" (more than two edges, with more than two nodes).
Simple braid example
Complex braid example
Normally, I would just use the NetworkX built-in function read_shp to import the shapefile as a DiGraph. As is evident in the examples, the "simple" braid will be considered a parallel edge in a NetworkX DiGraph, because those two edges (which share the same to and from nodes) would be collapsed into a single edge. In order to preserve these multiple edges, we wrote a function that imports a shapefile as a MultiDiGraph. Simple braids (i.e. parallel edges) are preserved by using unique keys in the edge objects (this is embedded in a class):
def _shp_to_nx(self, in_network_lyr, simplify=True, geom_attrs=True):
"""
This is a re-purposed version of read_shp from the NetworkX library.
:param shapelayer:
:param simplify:
:param geom_attrs:
:return:
"""
self.G = nx.MultiDiGraph()
for f in in_network_lyr.getFeatures():
flddata = f.attributes()
fields = [str(fi.name()) for fi in f.fields()]
geo = f.geometry()
# We don't care about M or Z
geo.geometry().dropMValue()
geo.geometry().dropZValue()
attributes = dict(zip(fields, flddata))
# Add a new _FID_ field
fid = int(f.id())
attributes[self.id_field] = fid
attributes['_calc_len_'] = geo.length()
# Note: Using layer level geometry type
if geo.wkbType() in (QgsWKBTypes.LineString, QgsWKBTypes.MultiLineString):
for edge in self.edges_from_line(geo, attributes, simplify, geom_attrs):
e1, e2, attr = edge
self.features[fid] = attr
self.G.add_edge(tuple(e1), tuple(e2), key=attr[self.id_field], attr_dict=attr)
self.cols = self.features[self.features.keys()[0]].keys()
else:
raise ImportError("GeometryType {} not supported. For now we only support LineString types.".
format(QgsWKBTypes.displayString(int(geo.wkbType()))))
I have already written a function to find the "simple" braid features (I just iterate through the MultiDiGraphs nodes, and find edges with more than one key). But I also need to find the "complex" braids. Normally, in a Graph, I could use the cycle_basis to find all of the "complex" braids (i.e. cycles), however, the cycle_basis method only works on un-directed Graphs, not directional graphs. But I'd rather not convert my MultiDiGraph into an un-directed Graph, as there can be unexpected results associated with that conversion (not to mention losing my edge key values).
How could I go about finding cycles which are made up of more than one edge, in a relatively time-efficient way? The stream networks I'm really working with can be quite large and complex, representing large watersheds.
Thanks!

So I came up with a solution, for finding both "simple" and "complex" braids.
def get_complex_braids(self, G, attrb_field, attrb_name):
"""
Create graph with the braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
"""
if nx.is_directed(G):
UG = nx.Graph(G)
braid_G = nx.MultiDiGraph()
for edge in G.edges(data=True, keys=True):
is_edge = self.get_edge_in_cycle(edge, UG)
if is_edge == True:
braid_G.add_edge(*edge)
self.update_attribute(braid_G, attrb_field, attrb_name)
return braid_G
else:
print "ERROR: Graph is not directed."
braid_complex_G = nx.null_graph()
return braid_complex_G
def get_simple_braids(self, G, attrb_field, attrb_name):
"""
Create graph with the simple braid edges attributed
:param attrb_field: name of the attribute field
:return braid_G: graph with new attribute
"""
braid_simple_G = nx.MultiDiGraph()
parallel_edges = []
for e in G.edges_iter():
keys = G.get_edge_data(*e).keys()
if keys not in parallel_edges:
if len(keys) == 2:
for k in keys:
data = G.get_edge_data(*e, key=k)
braid_simple_G.add_edge(e[0], e[1], key=k, attr_dict=data)
parallel_edges.append(keys)
self.update_attribute(braid_simple_G, attrb_field, attrb_name)
return braid_simple_G

This is not a definite answer, but longer than maximum allowed characters for a comment, so I post it here anyway.
To find simple braids, you can use built-in methods G.selfloop_edges and G.nodes_with_selfloops.
I haven't heard about cycle_basis for directed graphs, can you provide a reference (e.g. scientific work)? NetworkX has simple_cycles(G) which works on directed Graphs, but it is also not useful in this case, because water does not visit any node twice (or?).
I am afraid that the only way is to precisely describe the topology and then search the graph to find matching occurrences. let me clarify my point with an example. the following function should be able to identify instances of complex braids similar to your example:
def Complex_braid(G):
res = []
# find all nodes with out_degree greater than one:
candidates = [n for n in G.nodes() if len(G.successors(n)) > 1]
# find successors:
for n in candidates:
succ = G.successors(n)
for s in succ:
if len(list(nx.all_simple_paths(G,n,s))) > 1:
all_nodes = sorted(list(nx.all_simple_paths(G,n,s)), key=len)[-1]
res.append(all_nodes)
return res
G = nx.MultiDiGraph()
G.add_edges_from([(0,1), (1,2), (2,3), (4,5), (1,5), (5,2)])
Complex_braid(G)
# out: [[1, 5, 2]]
but the problem actually is that complex braids can be in different topological configurations and therefore it doesn't really make sense to define all possible topological configurations, unless you can describe them with one (or few) patterns or you can find a condition that signify the presence of complex braid.

How to represent networkx graphs with edge weight using nxpd like outptut

Recently I asked the question How to represent graphs with ipython. The answer was exactly what i was looking for, but today i'm looking for a way to show the edge valuation on the final picture.
The edge valuation is added like this :
import networkx as nx
from nxpd import draw # If another library do the same or nearly the same
# output of nxpd and answer to the question, that's
# not an issue
import random
G = nx.Graph()
G.add_nodes_from([1,2])
G.add_edge(1, 2, weight=random.randint(1, 10))
draw(G, show='ipynb')
And the result is here.
I read the help of nxpd.draw (didn't see any web documentation), but i didn't find anything.
Is there a way to print the edge value ?
EDIT : also, if there's a way to give a formating function, this could be good. For example :
def edge_formater(graph, edge):
return "My edge %s" % graph.get_edge_value(edge[0], edge[1], "weight")
EDIT2 : If there's another library than nxpd doing nearly the same output, it's not an issue
EDIT3 : has to work with nx.{Graph|DiGraph|MultiGraph|MultiDiGraph}

If you look at the source nxpd.draw (function draw_pydot) calls to_pydot which filters the graph attributes like:
if attr_type == 'edge':
accepted = pydot.EDGE_ATTRIBUTES
elif attr_type == 'graph':
accepted = pydot.GRAPH_ATTRIBUTES
elif attr_type == 'node':
accepted = pydot.NODE_ATTRIBUTES
else:
raise Exception("Invalid attr_type.")
d = dict( [(k,v) for (k,v) in attrs.items() if k in accepted] )
If you look up pydot you find the pydot.EDGE_ATTRIBUTES which contains valid Graphviz-attributes. The weight strictly refers to edge-weight by Graphviz if I recall, and label is probably the attribute you need. Try:
G = nx.Graph()
G.add_nodes_from([1,2])
weight=random.randint(1, 10)
G.add_edge(1, 2, weight=weight, label=str(weight))
Note that I haven't been able to test if this works, just downvote if it doesn't.

Finding Successors of Successors in a Directed Graph in NetworkX

I'm working on some code for a directed graph in NetworkX, and have hit a block that's likely the result of my questionable programming experience. What I'm trying to do is the following:
I have a directed graph G, with two "parent nodes" at the top, from which all other nodes flow. When graphing this network, I'd like to graph every node that is a descendant of "Parent 1" one color, and all the other nodes another color. Which means I need a list Parent 1's successors.
Right now, I can get the first layer of them easily using:
descend= G.successors(parent1)
The problem is this only gives me the first generation of successors. Preferably, I want the successors of successors, the successors of the successors of the successors, etc. Arbitrarily, because it would be extremely useful to be able to run the analysis and make the graph without having to know exactly how many generations are in it.
Any idea how to approach this?

You don't need a list of descendents, you just want to color them. For that you just have to pick a algorithm that traverses the graph and use it to color the edges.
For example, you can do
from networkx.algorithms.traversal.depth_first_search import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
See https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.traversal.depth_first_search.dfs_edges.html?highlight=traversal

If you want to get all the successor nodes, without passing through edges, another way could be:
import networkx as nx
G = DiGraph( ... )
successors = nx.nodes(nx.dfs_tree(G, your_node))
I noticed that if you call instead:
successors = list(nx.dfs_successors(G, your_node)
the nodes of the bottom level are somehow not included.

Well, the successor of successor is just the successor of the descendants right?
# First successors
descend = G.successors(parent1)
# 2nd level successors
def allDescendants(d1):
d2 = []
for d in d1:
d2 += G.successors(d)
return d2
descend2 = allDescendants(descend)
To get level 3 descendants, call allDescendants(d2) etc.
Edit:
Issue 1:
allDescend = descend + descend2 gives you the two sets combined, do the same for further levels of descendants.
Issue2: If you have loops in your graph, then you need to first modify the code to test if you've visited that descendant before, e.g:
def allDescendants(d1, exclude):
d2 = []
for d in d1:
d2 += filter(lambda s: s not in exclude, G.successors(d))
return d2
This way, you pass allDescend as the second argument to the above function so it's not included in future descendants. You keep doing this until allDescandants() returns an empty array in which case you know you've explored the entire graph, and you stop.
Since this is starting to look like homework, I'll let you figure out how to piece all this together on your own. ;)

So that the answer is somewhat cleaner and easier to find for future folks who stumble upon it, here's the code I ended up using:
G = DiGraph() # Creates an empty directed graph G
infile = open(sys.argv[1])
for edge in infile:
edge1, edge2 = edge.split() #Splits data on the space
node1 = int(edge1) #Creates integer version of the node names
node2 = int(edge2)
G.add_edge(node1,node2) #Adds an edge between two nodes
parent1=int(sys.argv[2])
parent2=int(sys.argv[3])
data_successors = dfs_successors(G,parent1)
successor_list = data_successors.values()
allsuccessors = [item for sublist in successor_list for item in sublist]
pos = graphviz_layout(G,prog='dot')
plt.figure(dpi=300)
draw_networkx_nodes(G,pos,node_color="LightCoral")
draw_networkx_nodes(G,pos,nodelist=allsuccessors, node_color="SkyBlue")
draw_networkx_edges(G,pos,arrows=False)
draw_networkx_labels(G,pos,font_size=6,font_family='sans-serif',labels=labels)

I believe Networkx has changed since #Jochen Ritzel 's answer a few years ago.
Now the following holds, only changing the import statement.
import networkx
from networkx import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)

Oneliner:
descendents = sum(nx.dfs_successors(G, parent).values(), [])

nx.descendants(G, parent)
more details: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.descendants.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Proper subgraphing of a PySpark GraphFrame - python

I recommend using dropIsolatedVertices().

Related

how to create a network with node and edges from nested data structure?

Drawing graph with labels in networkx obtained from a py2neo query

Use NetworkX to find cycles in MultiDiGraph imported from shapefile

How to represent networkx graphs with edge weight using nxpd like outptut

Finding Successors of Successors in a Directed Graph in NetworkX

Categories

Resources