Networkx Array of Business Connections - python

I am trying to create a networkx graph mapping the business connections in our database. In other words, I would like every id (i.e. each individual business) to be a node, and I would like there to be line connecting the nodes that are 'connected'. A business is considered as connected with another if the lead_id and connection_id are associated together as per the below data structure.
lead_id connection_id
56340 1
56340 2
58684 3
58696 4
58947 5
Every example I find on the networkx documentation uses the following
G=nx.random_geometric_graph(200,0.125)
pos=nx.get_node_attributes(G,'pos')
I am trying to determine how to incorporate my values into this.

Here is a way to create a graph from the data presented:
G = nx.Graph()
for node in zip(data.lead_id,data.connection_id):
G.add_edge(node[0],node[1])

Related

how to construct graph from house price prediction dataset

I have a dataset of house price predictions.
House id
society_id
building_type
households
yyyymmdd
floor
price
date
204
a9cvzgJ
170
185
01/02/2006
3
43000
01/02/2006
100
a4Nkquj
170
150
01/04/2006
13
46300
01/04/2006
the dataset has the shape of (2000,40)
while 1880 rows have same house id.
I have to make heterogenous graphs from dataset. the metapaths are as follows:
here BT stands for building type, where H1 and H2 represents house 1 and house 2.
the meta graph example is:
I know of network X. it allows dataframe to graph function . but i don't know how can i use in my scenario. the price column is target node.
A glimpse of dataset
any guidance will mean a lot.
thank you. The goal is to make adjancy matrix of dataset
To build a graph like M_1 using only one attribute (such as building type), you could do either of the following. You could use the from_pandas_edgelist as follows:
G = nx.from_pandas_edgelist(df2, source = 'house_id', target = 'buidling_id')
or you could do the following:
G = nx.Graph()
G.add_edges_from(df.loc[:,['house_id','building_id']].to_numpy())
If you have a list of graphs glist : [M_1,M_2,...] each of which connects house_id to one other attribute, you can combine them using the compose_all function. For instance,
G = nx.compose_all(glist)
Alternatively, if you have an existing graph made using certain attributes, you can add another attribute with
G.add_edges_from(df.loc[:,['house_id','new_attribute']].to_numpy())

Is the Dataframe is ok for representing graph?

I want to represent relationships between nodes in python using pandas.DataFrame
And each relationship has weight so I used dataframe like this.
nodeA nodeB nodeC
nodeA 0 5 1
nodeB 5 0 4
nodeC 1 4 0
But I think this is improper way to express relationships because the dataframe
is symmetric, has duplicated datas.
Is there more proper way than using dataframe to represent graph in python?
(Sorry for my bad English)
This seems like an acceptable way to represent a graph, and is in fact compatible with, say, nextworkx. For example, you can recover a nextworkx graph object as follows:
import networkx as nx
g = nx.from_pandas_adjacency(df)
print(g.edges)
# [('nodeA', 'nodeB'), ('nodeA', 'nodeC'), ('nodeB', 'nodeC')]
print(g.get_edge_data('nodeA', 'nodeB'))
# {'weight': 5}
If your graph is sparse, you may want to store it as an edge list instead, e.g. as discussed here.

create nested groups of nodes in networkx

I am trying to use networkx to create groups within groups, etc of nodes.
For instance I have nodes [1,2,3,4,5,6] currently no sub groups. I want it to end up being like this
[1,2,[[3,4],[5,6]]
Currently I am just doing a for loop with some data and adding the nodes to the graph like this.
self.G = nx.Graph()
for s in self.tabledata:
self.G.add_node(s[0])
self.G.__dict__['_node'][s[0]]['label'] = '{0}'.format(s[0])
nx.write_graphml(self.G, '/filename')
where self.tabledata contains the following values [1234,2345,3456,4567,5678,6789]
I want to move 3 and 4 to be in a group 'A' together, 5 and 6 to be in group 'B' together and groups A and B to be in group 'C' along with nodes 1, and 2.
so as far as groups are concerned you have this [C[A,B]]
Any ideas how this can be accomplished?

How to output attributes of nodes of a Graph (NetworkX) into a DataFrame (Pandas)

I am working with networks as graph of the interaction between characters in Spanish theatre plays. Here a visualisation:
I passed several attributes of the nodes (characters) as a dataframe to the network, so that I can use this values (for example the color of the nodes is set by the gender of the character). I want to calculate with NetworkX different values for each nodes (degree, centrality, betweenness...); then, I would like to output as a DataFrame both my attributes of the nodes and also the values calculated with NetworkX. I know I can ask for specific attributes of the nodes, like:
nx.get_node_attributes(graph,'degree')
And I could build a DataFrame using that, but I wonder if there is no a more elegant solution. I have tried also:
nx.to_dict_of_dicts(graph)
But this outputs only the edges and not the information about the nodes.
So, any help, please? Thanks!
If I understand your question correctly, you want to have a DataFrame which has nodes and some of the attribute of each node.
G = nx.Graph()
G.add_node(1, {'x': 11, 'y': 111})
G.add_node(2, {'x': 22, 'y': 222})
You can use list comprehension as follow to get specific node attributes:
pd.DataFrame([[i[0], i[1]['x'], i[1]['y']] for i in G.nodes(data=True)]
, columns=['node_name', 'x', 'y']).set_index('node_name')
# x y
#node_name
#1 11 111
#2 22 222
or if there are many attributes and you need all of them, this could be a better solution:
pd.DataFrame([i[1] for i in G.nodes(data=True)], index=[i[0] for i in G.nodes(data=True)])

What is an efficient way of creating a "network" from identifier data in pandas

I am a newbie in python and after browsing several answers to various questions concerning loops in python/pandas, I remain confused on how to solve my problem concerning water management data. I am trying to categorise and aggregate data based on its position in the sequence of connected nodes. The "network" is formed by each node containing the ID of the node that is downstream.
The original data contains roughly 53 000 items, which I converted to a pandas dataframe and looks something like this:
subwatershedsID = pd.DataFrame({ 'ID' : ['649208-127140','649252-127305','650556-126105','687315-128898'],'ID_DOWN' : ['582500-113890','649208-127140','649252-127305','574050-114780'], 'OUTLET_ID' : ['582500-113890','582500-113890','582500-113890','574050-114780'], 'CATCH_ID' : [217,217,217,213] })
My naive approach to deal with the data closest to the coast illustrates what I am trying to achieve.
sbwtrshdNextToStretch = subwatershedsID.loc[subwatershedsID['ID_DOWN'] == subwatershedsID['OUTLET_ID']]
sbwtrshdNextToStretchID = sbwtrshdNextToStretch[['ID']]
sbwtrshdStepFurther = pd.merge(sbwtrshdNextToStretchID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdStepFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdStepFurtherID = sbwtrshdStepFurther[['ID']]
sbwtrshdTwoStepsFurther = pd.merge(sbwtrshdStepFurtherID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdTwoStepsFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdTwoStepsFurtherID = sbwtrshdTwoStepsFurther[['ID']]
subwatershedsAll = [sbwtrshdNextToStretchID, sbwtrshdStepFurtherID, sbwtrshdTwoStepsFurtherID]
subwatershedWithDistances = pd.concat(subwatershedsAll, keys=['d0', 'd1', 'd2'])
So this gives each node an identifier on how many nodes away it is from the first one and it feels like there should be a more simple way to achieve it and obviously something that works better for the whole data that can be with large number of consecutive connections. However, my thoughts are continuously returning to writing a loop within a loop, but all the advise seems to recommend avoiding them and hence also discourages from learning how to write the loop correctly. Furthermore, the comments on poor loop performance leave me with further doubts, since I am not sure how fast solving for 53 000 rows would be. So what would be a good python style solution?
If I understand correctly you have two stages:
Categorise each node based on its position in the network
Perform calculations on the data to work out things like volumes of water, number of nodes a certain distance from the outlet, etc.
If so...
1) Use NetworkX to perform the calculations on relative position in the network
NetworkX is a great network analysis library that comes with ready-made methods to achieve this kind of thing.
Here's an example using dummy data:
G = nx.Graph()
G.add_nodes_from([1,2,3,4])
G.add_edges_from([(1,2),(2,3),(3,4)])
# In this example, the shortest path is all the way down the stream
nx.shortest_path(G,1,4)
> [1,2,3,4]
len(nx.shortest_path(G,1,4))
> 4
# I've shortened the path by adding a new 'edge' (connection) between 1 and 4
G.add_edges_from([(1,2),(2,3),(3,4),(1,4)])
# Result is a much shorter path of only two nodes - the source and target
nx.shortest_path(G,1,4)
> [1,4]
len(nx.shortest_path(G,1,4))
> 2
2) Annotate the dataframe for later calculations
Once you have this data in a network format, you can iterate through the data and add that as metadata to the DataFrame.

Categories