Is the Dataframe is ok for representing graph?

Is the Dataframe is ok for representing graph? - python

I want to represent relationships between nodes in python using pandas.DataFrame
And each relationship has weight so I used dataframe like this.
nodeA nodeB nodeC
nodeA 0 5 1
nodeB 5 0 4
nodeC 1 4 0
But I think this is improper way to express relationships because the dataframe
is symmetric, has duplicated datas.
Is there more proper way than using dataframe to represent graph in python?
(Sorry for my bad English)

This seems like an acceptable way to represent a graph, and is in fact compatible with, say, nextworkx. For example, you can recover a nextworkx graph object as follows:
import networkx as nx
g = nx.from_pandas_adjacency(df)
print(g.edges)
# [('nodeA', 'nodeB'), ('nodeA', 'nodeC'), ('nodeB', 'nodeC')]
print(g.get_edge_data('nodeA', 'nodeB'))
# {'weight': 5}
If your graph is sparse, you may want to store it as an edge list instead, e.g. as discussed here.

Related

Compare every cell in a dataframe with its surrounding cells

I have a dataframe similar to this yet, but a lot bigger (3000x3000):
A
B
C
D
E
W
3
1
8
3
4
X
2
2
9
1
1
Y
5
7
1
3
7
Z
6
8
5
8
9
where the [A,B,C,D,E] are the column names and [W,X,Y,Z] are the rows indexs.
I want to compare every cell with its surrounding cells. If the cell has a greater value than its neighbor cell value, create a directed edge (using networkX package) from that cell to its smaller value neighbor cell. For example:
examining cell (X,B), we should add the following:
G.add_edge((X,B), (W,B)) and G.add_edge((X,B), (Y,C)) and so on for every cell in the dataframe.
Currently I am doing it using two nested loops. However this takes hours to finish and a lot of resources (RAM).
Is there any more efficient way to do it?

If you want to have edges in a networkx graph, then you will not be able to avoid the nested for loop.
The comparison is actually easy to optimize. You could make four copies of your matrix and shift each one step into each direction. You are then able to vectorize the comparison by a simple df > df_copy for every direction.
Nevertheless, when it comes to creating the edges in your graph, it is necessary for you to iterate over both axes.
My recommendation is to write the data preparation part in Cython. Also have a look at graph-tools which at its core is written in C++. With that much edges you will probably also get performance issues in networkx itself.

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?

Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.

depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

Networkx Array of Business Connections

I am trying to create a networkx graph mapping the business connections in our database. In other words, I would like every id (i.e. each individual business) to be a node, and I would like there to be line connecting the nodes that are 'connected'. A business is considered as connected with another if the lead_id and connection_id are associated together as per the below data structure.
lead_id connection_id
56340 1
56340 2
58684 3
58696 4
58947 5
Every example I find on the networkx documentation uses the following
G=nx.random_geometric_graph(200,0.125)
pos=nx.get_node_attributes(G,'pos')
I am trying to determine how to incorporate my values into this.

Here is a way to create a graph from the data presented:
G = nx.Graph()
for node in zip(data.lead_id,data.connection_id):
G.add_edge(node[0],node[1])

How to output attributes of nodes of a Graph (NetworkX) into a DataFrame (Pandas)

I am working with networks as graph of the interaction between characters in Spanish theatre plays. Here a visualisation:
I passed several attributes of the nodes (characters) as a dataframe to the network, so that I can use this values (for example the color of the nodes is set by the gender of the character). I want to calculate with NetworkX different values for each nodes (degree, centrality, betweenness...); then, I would like to output as a DataFrame both my attributes of the nodes and also the values calculated with NetworkX. I know I can ask for specific attributes of the nodes, like:
nx.get_node_attributes(graph,'degree')
And I could build a DataFrame using that, but I wonder if there is no a more elegant solution. I have tried also:
nx.to_dict_of_dicts(graph)
But this outputs only the edges and not the information about the nodes.
So, any help, please? Thanks!

If I understand your question correctly, you want to have a DataFrame which has nodes and some of the attribute of each node.
G = nx.Graph()
G.add_node(1, {'x': 11, 'y': 111})
G.add_node(2, {'x': 22, 'y': 222})
You can use list comprehension as follow to get specific node attributes:
pd.DataFrame([[i[0], i[1]['x'], i[1]['y']] for i in G.nodes(data=True)]
, columns=['node_name', 'x', 'y']).set_index('node_name')
# x y
#node_name
#1 11 111
#2 22 222
or if there are many attributes and you need all of them, this could be a better solution:
pd.DataFrame([i[1] for i in G.nodes(data=True)], index=[i[0] for i in G.nodes(data=True)])

What is an efficient way of creating a "network" from identifier data in pandas

I am a newbie in python and after browsing several answers to various questions concerning loops in python/pandas, I remain confused on how to solve my problem concerning water management data. I am trying to categorise and aggregate data based on its position in the sequence of connected nodes. The "network" is formed by each node containing the ID of the node that is downstream.
The original data contains roughly 53 000 items, which I converted to a pandas dataframe and looks something like this:
subwatershedsID = pd.DataFrame({ 'ID' : ['649208-127140','649252-127305','650556-126105','687315-128898'],'ID_DOWN' : ['582500-113890','649208-127140','649252-127305','574050-114780'], 'OUTLET_ID' : ['582500-113890','582500-113890','582500-113890','574050-114780'], 'CATCH_ID' : [217,217,217,213] })
My naive approach to deal with the data closest to the coast illustrates what I am trying to achieve.
sbwtrshdNextToStretch = subwatershedsID.loc[subwatershedsID['ID_DOWN'] == subwatershedsID['OUTLET_ID']]
sbwtrshdNextToStretchID = sbwtrshdNextToStretch[['ID']]
sbwtrshdStepFurther = pd.merge(sbwtrshdNextToStretchID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdStepFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdStepFurtherID = sbwtrshdStepFurther[['ID']]
sbwtrshdTwoStepsFurther = pd.merge(sbwtrshdStepFurtherID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdTwoStepsFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdTwoStepsFurtherID = sbwtrshdTwoStepsFurther[['ID']]
subwatershedsAll = [sbwtrshdNextToStretchID, sbwtrshdStepFurtherID, sbwtrshdTwoStepsFurtherID]
subwatershedWithDistances = pd.concat(subwatershedsAll, keys=['d0', 'd1', 'd2'])
So this gives each node an identifier on how many nodes away it is from the first one and it feels like there should be a more simple way to achieve it and obviously something that works better for the whole data that can be with large number of consecutive connections. However, my thoughts are continuously returning to writing a loop within a loop, but all the advise seems to recommend avoiding them and hence also discourages from learning how to write the loop correctly. Furthermore, the comments on poor loop performance leave me with further doubts, since I am not sure how fast solving for 53 000 rows would be. So what would be a good python style solution?

If I understand correctly you have two stages:
Categorise each node based on its position in the network
Perform calculations on the data to work out things like volumes of water, number of nodes a certain distance from the outlet, etc.
If so...
1) Use NetworkX to perform the calculations on relative position in the network
NetworkX is a great network analysis library that comes with ready-made methods to achieve this kind of thing.
Here's an example using dummy data:
G = nx.Graph()
G.add_nodes_from([1,2,3,4])
G.add_edges_from([(1,2),(2,3),(3,4)])
# In this example, the shortest path is all the way down the stream
nx.shortest_path(G,1,4)
> [1,2,3,4]
len(nx.shortest_path(G,1,4))
> 4
# I've shortened the path by adding a new 'edge' (connection) between 1 and 4
G.add_edges_from([(1,2),(2,3),(3,4),(1,4)])
# Result is a much shorter path of only two nodes - the source and target
nx.shortest_path(G,1,4)
> [1,4]
len(nx.shortest_path(G,1,4))
> 2
2) Annotate the dataframe for later calculations
Once you have this data in a network format, you can iterate through the data and add that as metadata to the DataFrame.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is the Dataframe is ok for representing graph? - python

Related

Compare every cell in a dataframe with its surrounding cells

How to cluster data based on a subset of attributes (4 attributes)?

Networkx Array of Business Connections

How to output attributes of nodes of a Graph (NetworkX) into a DataFrame (Pandas)

What is an efficient way of creating a "network" from identifier data in pandas

Categories

Resources