How to make networkx edges from pandas dataframe rows - python

For context:
I am making a visual graph for a protein-protein interaction network. A node here corresponds to a protein and an edge would indicate interaction between two nodes.
Here is my code:
First I import all the modules and files that I need:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
interactome_edges = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
interactome_nodes = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
# A few adjustments for the dataframes
interactome_nodes = interactome_nodes.drop(columns = [0])
interactome_edges.columns = ["node1","node2"]
Dataframe for nodes looks like this:
1
0 MET3
1 IMD3
2 OLE1
3 MUP1
4 PIS1
...
Dataframe for edges looks like this:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
...
Basically the edge goes from node1 to node2
Now I iterate through each row from the node dataframe and edge dataframe and use it as networkx nodes and edges.
interactome = nx.Graph()
# Adding Nodes to Graph
for index, row in interactome_nodes.iterrows():
interactome.add_nodes_from(row)
# Adding Edges to Graph
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1", "node2"]) #### Here is the problem
My problem is at the adding Edges part.
I am currently getting the following error:
KeyError: ('node1', 'node2')
I have also tried :
for index, row in interactome_edges.iterrows():
interactome.add_edges_from((row["node1"],row["node2"]))
and:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1"],row["node2"])
and also simply:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row)
All of which give me some form of error.
How can I use my node to node dataframe as edges for a networkx graph?

In [9]: import networkx as nx
In [10]: import pandas as pd
In [11]: df = pd.read_csv("a.csv")
In [12]: df
Out[12]:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
In [13]: G=nx.from_pandas_edgelist(df, "node1", "node2")
In [14]: [e for e in G.edges]
Out[14]:
[('MET3', 'MET3'),
('IMD3', 'IMD4'),
('OLE1', 'OLE1'),
('MUP1', 'MUP1'),
('PIS1', 'PIS1')]
Networkx has methods to read from pandas dataframe. I have use the edge dataframe provided. Here, I'm using from_pandas_edgelist method to read from the dataframe of edges.
After plotting the graph,
nx.draw_planar(G, with_labels = True)
plt.savefig("filename2.png")

Related

DataFrame index and column as nodes for Networkx

I'm looking to convert a DataFrame to a NetworkX graph: I would like to use the Dataframe as a map where the indexes are the "sources" and the columns are the "targets". The values should be the weights.
df = pd.DataFrame(np.random.randint(0,3,size=(4, 4)), columns=list('ABCD'), index = list('ABCD'))
df
G = nx.from_pandas_edgelist(
df, source=df.index, target=df.column, create_using=nx.DiGraph
)
Here is a example of DataFrame, each index should connect to a column if the value is non-zero.
Would you know how to?
Use nx.from_pandas_adjacency():
import pandas as pd
import numpy as np
import networkx as nx
df = pd.DataFrame(np.random.randint(0,3,size=(4, 4)), columns=list('ABCD'), index = list('ABCD'))
G = nx.from_pandas_adjacency(df, create_using=nx.DiGraph)
As the comment form #Huug points out be aware of passing create_using=nx.DiGraph to the command, to ensure it is created as a directed graph.

how to store "networkx info" output in a data frame

I want to store output of following NetworkX output into a Pandas data frame:
for i in (node_id):
G.remove_nodes_from([i])
(nx.info(G))
Current output looks like follows:
Name:
Type: Graph
Number of nodes: 262
Number of edges: 455
Average degree: 3.4733
Name:
Type: Graph
Number of nodes: 261
Number of edges: 425
Average degree: 3.2567
Please, could you tell me a way to store these output into a data frame or dictionary
nx.info outputs a string, you can feed it to pandas.read_csv:
import networkx as nx
import io
import pandas as pd
# dummy graph
G = nx.star_graph(5)
df = pd.read_csv(io.StringIO(nx.info(G)), sep=':\s*', engine='python', names=['attribute', 'value'])
print(df)
Output:
attribute value
0 Name NaN
1 Type Graph
2 Number of nodes 6
3 Number of edges 5
4 Average degree 1.6667
NB. Note that nx.info is deprecated and will be removed in networkx 3

Why does networkx reduce number of nodes after adding edges

I need to start this by saying that my code runs without any error messages, but I don't understand some of the results.
I create a graph in networkx from a pandas data frame, that has 398595 integer IDs.
# Create Graph
G = nx.Graph()
G.name = "Graph from Pandas"
# Add Nodes to Graph
G.add_nodes_from(test_df['ID'].tolist())
print(nx.info(G))
The output from nx.info(G) is as follows, which is also correct this is what I expected:
Type: Graph
Number of nodes: 398595
Number of edges: 0
Average degree: 0.0000
Then I load a second pandas data frame and it contains 5556353 entries and has three columns:
ID1 ID2 weight
3 198 0.601002
3 183 0.618057
Each ID in ID1 or ID2 exists also into the first pandas dataframe, so I load the edges as follows:
# Add data to Graph
G = nx.from_pandas_edgelist(df,source='ID1',target='ID2', edge_attr='weight')
print(nx.info(G))
However here is what I don't understand, the output from nx.info(G) now returns:
Type: Graph
Number of nodes: 29348
Number of edges: 4371353
Average degree: 297.8978
Now my questions are (1) why are there fewer nodes in this graph than before and (2) why are there considerably fewer edges in this Graph than available from the data frame?
There are probably less unique IDs between ID1 and ID2 of df than there are in the ID column of test_df. The first thing I would check is if the unique IDs across ID1 and ID2 in df equals the number of nodes you display len(pd.unique(df[['ID1','ID2']].values.ravel())) (should equal 29348).
One reason there are fewer edges is if there are directed edges in the dataframe. The default value for the create_using parameter of nx.from_pandas_edgelist is nx.Graph() so edges will be treated as undirected and multiple edges are removed. If you want directed edges, multiple edges, or both, try passing nx.DiGraph,nx.MultiGraph, or nx.MultiDiGraph respectively to the create_using parameter.

Generate laplacian matrix from non-square dataset

I have a dataset as the following where the first and second columns indicate nodes connection from to:
fromNode toNode
0 1
0 2
0 31
0 73
1 3
1 56
2 10
...
I want to generate laplacian matrix from this dataset. I use the following code to do so but it complains as the dataset itself is not square matrix. Is there a function that accept this type of dataset and generates the matrix?
from numpy import genfromtxt
from scipy.sparse import csgraph
import csv
G = genfromtxt('./data.csv', delimiter='\t').astype(int)
dataset = csgraph.laplacian(G, normed=False)
Rather than find a function that will except your data, process your data into the correct format.
Fake data f simulates a file object. Use io.StringIO for Python 3.6.
data = '''0 1
0 2
0 31
0 73
1 3
1 56
2 10'''
f = io.BytesIO(data)
Read each line of the data and process it into a list of edges with the form (node1, node1).
edges = []
for line in f:
line = line.strip()
(node1, node2) = map(int, line.split())
edges.append((node1,node2))
Find the highest node number, create a square numpy ndarray based on the highest node number. You need to be aware of your node numbering - is it zero based?
N = max(x for edge in edges for x in edge)
G = np.zeros((N+1,N+1), dtype = np.int64)
Iterate over the edges and assign the edge weight to the Graph
for row, column in edges:
G[row,column] = 1
Here is a solution making use of numpy integer array indexing.
z = np.genfromtxt(f, dtype = np.int64)
n = z.max() + 1
g = np.zeros((n,n), dtype = np.int64)
rows, columns = z.T
g[rows, columns] = 1
Of course both of those assume all edge weights are equal.
See Graph Representations in the scipy docs. I couldn't try this graph to see if it is valid, I'm getting an import error for csgraph - probably need to update.

Creating key:attribute pairs in networkx for Python

I am working on creating a graph method for analyzing images using pixels as nodes in Python. Using networkx as graph support(documentation here: https://networkx.github.io/documentation/latest/index.html ) Take this as an example:
new=np.arange(256)
g=nx.Graph()
for x in new:
g.add_node(x)
h=g.order()
print h
As expected, 256 nodes will be created.
Now, I would like to create node:attribute pairs based on another array, namely:
newarray=np.arange(256)
for x in new:
g.add_node(x)
nx.set_node_attributes(g, 'value' newarray[x])
With the addition of this line, I was hoping that the first node of newarray would be assigned to the first node of g. However, rather, all values of g will be assigned the last value of newarray. Namely, 256. How can I add attribute pairs for each node, element by element?
You need to pass in a dictionary as the third parameter for set_node_attribute, one that's aligned with the graph. See if this code does what you need:
import numpy as np
import networkx as nx
array1 = np.arange(256)
array2 = np.arange(256) * 10
g = nx.Graph()
valdict = {}
for x in array1:
g.add_node(x)
valdict[x] = array2[x]
nx.set_node_attributes(g, 'value', valdict)
for i in array1:
print g.nodes()[i], g.node[i]['value']

Categories