Merge nodes that share attributes - python

EDITED
I really need help from Networkx/graph experts.
Let us say I have the following data frames and I would like to convert these data frames to graphs. Then I would like to map the two graphs with corresponding nodes based on description and priority attributes.
df1
From description To priority
10 Start 20, 50 1
20 Left 40 2
50 Bottom 40 2
40 End - 1
df2
From description To priority
60 Start 70,80 1
70 Left 80, 90 2
80 Left 100 2
90 Bottom 100 2
100 End - 1
I just converted the two data frames and created a graph (g1, and g2).
And then I am trying to match the nodes based on their description and priority for only once. for example 10/60, 40/100, 50/90 but not 20/70, 20/80, and 70/80. 20 has three conditions to be mapped which are not what I want. Because I would like to map nodes for only once unless I would like to put them as a single node and mark the node as red to differentiate.
A node should only be mapped for only once means, for example, if I want to map 10, it has priority 1 and description Start on the first graph and then find the same priority and description on the second graph. For this, 60 is there. There are no other nodes other than 60. But if we take 20 on the first graph, it has priority 2 and description left. On the second graph, there are two nodes with priority 2 and description left which is 70 and 80. This creates confusion. I cannot map 20 twice like 20/70 and 20/80. But I would like to put them as a single node as shown below on the sample graph.
I am expecting the following result.
To get the above result, I tried it with the following python code.
mapped_list= []
for node_1, data_1 in g1.nodes(data=True):
for node_2, data_2 in g2.nodes(data=True):
if (((g1.node[node_1]['priority']) == (g2.node[node_2]['priority'])) &
((g1.node[node_1]['description']) == (g2.node[node_2]['description']))
):
if (node_1 in mapped_list) & (node_2 in mapped_list): // check of if the node exist on the mapped_list
pass
else:
name = str(node_1) + '/' + str(node_2)
mapped_list.append((data_1["priority"], data_1["descriptions"], node_1, name))
mapped_list.append((data_2["priority"], data_2["descriptions"], node_2, name))
Can anyone help me to achieve the above result shown on the figure /graph/? Any help is appreciated.

The way I'd go about this instead, is to build a new graph taking the nx.union of both graphs, and then "combine" together the start and end nodes that share attributes using contracted_nodes.
Let's start by creating both graphs from the dataframes:
df1 = df1.drop('To',1).join(df1.To.str.replace(' ','').str.split(',').explode())
df2 = df2.drop('To',1).join(df2.To.str.replace(' ','').str.split(',').explode())
g1 = nx.from_pandas_edgelist(df1.iloc[:-1,[0,3]].astype(int),
source='From', target='To', create_using=nx.DiGraph)
g2 = nx.from_pandas_edgelist(df2.iloc[:-1,[0,3]].astype(int),
source='From', target='To', create_using=nx.DiGraph)
df1_node_ix = df1.assign(graph='graph1').set_index('From').rename_axis('nodes')
nx.set_node_attributes(g1, values=df1_node_ix.description.to_dict(),
name='description')
nx.set_node_attributes(g1, values=df1_node_ix.priority.to_dict(),
name='priority')
nx.set_node_attributes(g1, values=df1_node_ix.graph.to_dict(),
name='graph')
df2_node_ix = df2.assign(graph='graph2').set_index('From').rename_axis('nodes')
nx.set_node_attributes(g2, values=df2_node_ix.description.to_dict(),
name='description')
nx.set_node_attributes(g2, values=df2_node_ix.priority.to_dict(),
name='priority')
nx.set_node_attributes(g2, values=df2_node_ix.graph.to_dict(),
name='graph')
Now by taking the nx.union of both graphs, we have:
g3 = nx.union(g1,g2)
from networkx.drawing.nx_agraph import graphviz_layout
plt.figure(figsize=(8,5))
pos=graphviz_layout(g3, prog='dot')
nx.draw(g3, pos=pos,
with_labels=True,
node_size=1500,
node_color='red',
arrowsize=20)
What we can do now is come up with some data structure which we can later use to easily combine the pairs of nodes that share attributes. For that we can sort the nodes by their description. Sorting them will enable us to use itertools.groupby to group consecutive equal pairs of nodes, which we can then easily combine using nx.contrated_nodes, and then just overwrite on the same previous graph. The nodes can be relabeled as specified in the question with nx.relabel_nodes:
from itertools import groupby
g3_node_view = g3.nodes(data=True)
sorted_by_descr = sorted(g3_node_view, key=lambda x: x[1]['description'])
node_colors = dict()
colors = {'Bottom':'saddlebrown', 'Start':'lightblue',
'Left':'green', 'End':'lightblue'}
all_graphs = {'graph1', 'graph2'}
for _, grouped_by_descr in groupby(sorted_by_descr,
key=lambda x: x[1]['description']):
for _, group in groupby(grouped_by_descr, key=lambda x: x[1]['priority']):
grouped_nodes = list(group)
nodes = [i[0] for i in grouped_nodes]
graphs = {i[1]['graph'] for i in grouped_nodes}
# check if there are two nodes that share attributes
# and both belong to different graphs
if len(nodes)==2 and graphs==all_graphs:
# contract both nodes and update graph
g3 = nx.contracted_nodes(g3, *nodes)
# define new contracted node name and relabel
new_node = '/'.join(map(str, nodes))
g3 = nx.relabel_nodes(g3, {nodes[0]:new_node})
node_colors[new_node] = colors[grouped_nodes[0][1]['description']]
else:
for node in nodes:
node_colors[node] = 'red'
Which would give:
plt.figure(figsize=(10,7))
pos=graphviz_layout(g3, prog='dot')
nx.draw(g3, pos=pos,
with_labels=True,
node_size=2500,
nodelist=node_colors.keys(),
node_color=node_colors.values(),
arrowsize=20)

Related

np.gentext loop and fill empty array in python

I have 50 different folders (from 0 to 50) with the data I want to plot named as data_A, data_B, data_C, data_D.
How do I iterate through the 50 folders, collect the data, apply a numerical operation and print the output (V) to a list?
The final goal would be to make a boxplot of (V) for each folder.
Hope that this attempt of code helps in understanding my aim:
directory='/path_to_data/'
folders = [0..50]
for i in (len(folders)):
A = (np.genfromtxt(directory/[i]/'data_A.dat')
B = (np.genfromtxt(directory/[i]/'data_B.dat')
C = (np.genfromtxt(directory/[i]/'data_C.dat')
D = (np.genfromtxt(directory/[i]/'data_D.dat')
V = (A+B+C+D)/4 #make and average of the data
DATA=[V,..]
NAMES=[folder$i,.. ]
done
Thanks!

Separate TAP and HOVER tool for Edges of hv.Graph. Edge description data is missing

Trying to get hv graph with ability to tap edges separately from nodes. In my case - all meaningful data bound to edges.
gNodes = hv.Nodes((nodes_data.x,nodes_data.y, nodes_data.nid, nodes_data.name),\
vdims=['name'])
gGraph = hv.Graph(((edges_data.source, edges_data.target, edges_data.name),gNodes),vdims=['name'])
opts = dict(width=1200,height=800,xaxis=None,yaxis=None,bgcolor='black',show_grid=True)
gEdges = gGraph.edgepaths
tiles = gv.tile_sources.Wikipedia()
(tiles * gGraph.edgepaths * gGraph.nodes.opts(size=12)).opts(**opts)
If I use gGraph.edgepaths * gGraph.nodes - where is no edge information displayed with Hover tool.
Inspection policy 'edges' for hv.Graph is not suitable for my task, because no single edge selection available.
Where did edge label information in edgepaths property gone? How to add it?
Thank you!
I've created separate dataframe for each link, then i grouped it by unique link label, and insert empty row between each group (two rows for edge - source and target), like in this case: Pandas: Inserting an empty row after every 2nd row in a data frame
emty_row = pd.Series(np.NaN,edges_data.columns)
insert_f = lambda d: d.append(emty_row, ignore_index=True)
edges_df = edges_test.groupby(by='name', group_keys=False).apply(insert_f).reset_index(drop=True)
and create hv.EdgesPaths from df:
gPaths2= hv.EdgePaths(edges_df, kdims=['lon_conv_a','lat_conv_a'])
TAP and HOVER works fine for me.

Converting a pandas nodes and edges list from node labels to node index

I have a tidy representation of a graph or network expressed as two separate csvs; one for nodes, one for edges with weights. I've read them from csv into pandas dataframes in Python 3.
I create some analogous dataframes using different methods here but use them for illustration of the problem.
import pandas as pd
# i have a nodes list
nodes = {'page': ['/', '/a', '/b']}
# the data is actually read in from csv
nodes = pd.DataFrame.from_dict(nodes)
nodes
Which returns the node list which has automatically been indexed by the default method (whatever that is; I read it varied between Python versions but that shouldn't impact the question).
page
0 /
1 /a
2 /b
The edge list is:
# and an edges list which uses node label; source and destination
# need to convert into indexes from nodes
edges = {'source_node': ['/', '/a', '/b', '/a'],
'destination_node': ['/b', '/b', '/', '/'],
'weight': [5, 2, 10, 5]}
# the data is actually read in from csv
edges = pd.DataFrame.from_dict(edges)
edges
Which looks like:
source_node destination_node weight
0 / /b 5
1 /a /b 2
2 /b / 10
3 /a / 5
Here you see the problem, the source and destination nodes are the labels rather than the correct node indexes from the previous dataframe. I want an edge pandas dataframe with the appropriate indices of the labelled nodes rather than their labels. I could do this upstream in the data pipeline but want to fix this here for convenience. The number of nodes and edges are of 22 k and 45 k respectively. I don't mind if the solution takes a few minutes to run.
I can get the information I'm after but can't assign it to a new pandas column in the edges dataframe.
I can get the indexes I want by looping but is there a better way to do this in pandas, can I vectorise the problem like in R?
for i in edges["source_node"]:
print(nodes[nodes.page == i].index.values.astype(int)[0])
for i in edges["destination_node"]:
print(nodes[nodes.page == i].index.values.astype(int)[0])
0
1
2
1
2
2
0
0
And how to I get this into my edges dataframe as two new columns, one called 'source' and one called 'destination'. What I want is:
source_node destination_node weight source destination
0 / /b 5 0 2
1 /a /b 2 1 2
2 /b / 10 2 0
3 /a / 5 1 0
Doing the following errors and doesn't look right to begin with:
edges['source'] = for i in edges["source_node"]:
nodes[nodes.page == i].index.values.astype(int)[0]
edges['destination'] = for i in edges["destination_node"]:
nodes[nodes.page == i].index.values.astype(int)[0]
As I'm new to Python, I'd be interested in a "Pythonic" way of solving this, as well as a method which is simple to my newbie eyes.
You can use map and set_index:
nodelist = nodes.reset_index().set_index('page').squeeze()
Or #mammykins suggested for a real world sample use:
nodelist = nodelist.loc[~nodelist.index.duplicated(keep='first')]
edges['source'] = edges.source_node.map(nodelist)
edges['destination'] = edges.destination_node.map(nodelist)
print(edges)
Output:
source_node destination_node weight source destination
0 / /b 5 0 2
1 /a /b 2 1 2
2 /b / 10 2 0
3 /a / 5 1 0

Create Dataframe out of distances and id's between dataframes

I'll try to explain what I'm currently working with:
I have two dataframes: one for Gas Station A (165 stations), and other for Gas Station B (257 stations). They both share the same format:
id Coor
1 (a1,b1)
2 (a2,b2)
Coor has tuples with the location coordinates. What I want to do is to add 3 columns to Dataframe A with nearest Competitor #1, #2 and #3 (from Gas Station B).
Currently I managed to get every distance from A to B (42405 distance measures), but in a list format:
distances=[]
for (u,v) in gasA['coor']:
for (w,x) in gasB['coor']:
distances.append(sp.distance.euclidean((u,v),(w,x)))
This lets me have the values I need, but I still need to match them with the ID from Gas Station A, and get the top 3. I have the suspicion working with lists is not the best approach here. Do you have any suggestions?
Edit: as suggested, first 5 rows are:
in GasA:
id coor
60712 (-333525363206695,-705191013427772)
60512 (-333539879388388, -705394161580837)
60085 (-333545609177068, -703168832659184)
60110 (-333601677229216, -705167284798638)
60078 (-333608898397271, -707213099595404)
in GasB:
id coor
70174 (-333427160000000,-705459060000000)
70223 (-333523030000000, -706705470000000)
70383 (-333549270000000, -705320990000000)
70162 (-333556960000000, -705384750000000)
70289 (-333565850000000, -705104360000000)
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
Creating the data:
A = pd.DataFrame({'id':['60712','60512','60085', '60110','60078'], 'coor':[ (-333525363206695,-705191013427772),\
(-333539879388388, -705394161580837),\
(-333545609177068, -703168832659184),\
(-333601677229216, -705167284798638),\
(-333608898397271, -707213099595404)]})
B = pd.DataFrame({'id':['70174','70223','70383', '70162','70289'], 'coor':[ (-333427160000000,-705459060000000),\
(-333523030000000, -706705470000000),\
(-333549270000000, -705320990000000),\
(-333556960000000, -705384750000000),\
(-333565850000000, -705104360000000)]})
Calculating the distances:
res = euclidean_distances(list(A.coor), list(B.coor))
Selecting top 3 closest stations from B and appending to a column in A:
d = []
for i, id_ in enumerate(A.index):
distances = np.argsort(res[i])[0:3] #select top 3
distances = B.iloc[distances]['id'].values
d.append(distances)
A = A.assign(dist=d)
edit
result of running with example:
coor id dist
0 (-333525363206695, -705191013427772) 60712 [70223, 70174, 70162]
1 (-333539879388388, -705394161580837) 60512 [70223, 70289, 70174]
2 (-333545609177068, -703168832659184) 60085 [70223, 70174, 70162]
3 (-333601677229216, -705167284798638) 60110 [70223, 70174, 70162]
4 (-333608898397271, -707213099595404) 60078 [70289, 70383, 70162]
Define a function that calculates the distances from A to all B's and returns indices of B with the three smallest distances.
def get_nearest_three(row):
(u,v) = row['Coor']
dist_list = gasB.Coor.apply(sp.distance.euclidean,args = [u,v])
# want indices of the 3 indices of B with smallest distances
return list(np.argsort(dist_list))[0:3]
gasA['dists'] = gasA.apply(get_nearest_three, axis = 1)
You can do something like this.
a = gasA.coor.values
b = gasB.coor.values
c = np.sum(np.sum((a[:,None,::-1] - b)**2, axis=1), axis=0)
we can get the numpy arrays for the coordinates for both and then broadcast a to represent all it's combinations and then take the euclidean distance.
Consider a cross join (matching every row by every row between both datasets) which should be manageable with your small sets, 165 X 257, then calculate the distance. Then, rank by distance and filter for top 3.
cj_df = pd.merge(gasA.assign(key=1), gasB.assign(key=1),
on="key", suffixes=['_A', '_B'])
cj_df['distance'] = cj_df.apply(lambda row: sp.distance.euclidean(row['Coor_A'],
row['Coor_B']),
axis = 1)
# RANK BY DISTANCE
cj_df['rank'] = .groupby('id_A')['distance'].rank()
# FILTER FOR TOP 3
top3_df = cj_df[cj_df['rank'] <= 3].sort_values(['id_A', 'rank'])

Pandas Dataframe: Accessing via composite index created by groupby operation

I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.

Categories