Generate Network plot from large pandas dataframe - python

Suppose I have this dataframe df that contains 3794 rows x 2 columns, where column a-number represents nodes with directed edges to nodes in b-number:
a_number b_number
0 0123456789343 0123456789991
1 0123456789343 0123456789633
2 0123456789343 0123456789633
3 0123456789343 0123456789628
4 0123456789343 0123456789633
... ... ...
3789 0123456789697 0123456789916
3790 0123456789697 0123456789886
3791 0123456789697 0123456789572
3792 0123456789697 0123456789884
3793 0123456789697 0123456789125
3794 rows × 2 columns
Additional information:
len(df['a_number'].unique())
>>> 18
len(df['b_number'].unique())
>>>1145
I am trying to generate an image representation of the graph. Here's code to apply networkx:
import networkx as nx
G = nx.DiGraph()
for i, (x, y) in df.iterrows():
G.add_node(x)
G.add_node(y)
G.add_edge(x,y)
nx.draw(G, with_labels = True, font_size=14 , node_size=2000)
I get this output:
I am having some problems in visualizing the graphs created with python-networkx, I want to able to reduce clutter and regulate the distance between the nodes. Please advise. What can I do on the code? thank you.

First to reduce the clutter I would start by decreasing the node size, to maybe 200 or 400.
Try to reduce the font_size parameter in the draw function. This parameter regulate the size of the node's labels. Since you have large node names, it will help reduce the clutter.
If having the labels on the graph is not necessary, then remove them to make it cleaner by passing the with_labels=False to the draw function.
Then to regulate the distance between nodes you an use the spring layout for the nodes position.
pos = nx.spring_layout(G, k=0.8)
nx.draw(G, pos , with_labels = True, font_size=7, node_size=400)
The k parameter in the spring layout allows you to regulate distance between nodes. You can try different values to see what suit you most.

Related

How to display graph in Pyvis more clearly?

I want to visualize a graph in Pyvis which its nodes has labels. I am completely able to visualize it in Pyvis but my problem is about the ways of visualizing it. The graph displayed in Pyvis is not clear and edges are messed up. Is there any way to visualize the graph more clear?
The image below shows the graph.
For example in the graph, node 15 is displayed well. I want other nodes to be displayed in a clear way that the connections can be displayed more clearly
Update:
This is the code i use for drawing graph using Pyvis:
def showGraph(FileName, labelList):
Txtfile = open("./results.txt")
G = nx.read_weighted_edgelist(Txtfile)
Txtfile.close()
palette = (sns.color_palette("Pastel1", n_colors=len(set(labelList.values()))))
palette = palette.as_hex()
colorDict = {}
counter = 0
for i in palette:
colorDict[counter] = i
counter += 1
N = Network(height='100%', width='100%', directed=False, notebook=False)
for n in G.nodes:
N.add_node(n, color=(colorDict[labelList[n]]), size=5)
for e in G.edges.data():
N.add_edge(e[0], e[1], title=str(e[2]), value=e[2]['weight'])
N.show('result.html')
results.txt is my edge list file and labelList holds label of each node. Labels are numerical. For example label of node 48 is 5, it can be anything. I use labels to give different colors to nodes.
The NetworkX circular layouts tend to make individual nodes and the connections between them easier to see, so you could try that as long as you don't want nodes to move (without dragging) after you've drawn them.
Before creating your pyvis network, run the following on your NetworkX graph to create a dictionary that will be keyed by node and have (x, y) positions as values. You might need to mess around with the scale parameter a bit to see what works best for you.
pos = nx.circular_layout(G, scale = 1000)
You can then add x and y values from pos to your pyvis network when you add each node. Adding physics = False keeps the nodes in one place unless you click and drag them around.
for n in G.nodes:
N.add_node(n,
color=(colorDict[labelList[n]]),
size=5,
x = pos[n][0],
y = pos[n][1],
physics = False)
I'm not sure how the edge weights will play into things, so you should probably also add physics = False to the add_edge parameters to ensure that nothing will move.
Since I didn't have your original data, I just generated a random graph with 10 nodes and this was the result in pyvis.

How is it possible to draw unconnected nodes with networkx?

When using networkx I only now that there are several possibilities of plotting graphs with edges and nodes.
Is it possible only to plot a lot of nodes, without connections between them? The points all have x- and y-coordinates. The points are saved in a pandas dataframe with only 3 columns: ID, X, Y
g = nx.from_pandas_dataframe(df1, source='x', target='y')
I tried something like this but I don´t want to have edges only points.
This is a part of the dataframe:
id x y
0 550 1005.600 1539.400
1 551 1006.600 1549.400
2 705 1029.997 2140.001
3 706 1030.997 2141.001
4 478 180.000 1354.370
5 479 190.000 1354.370
.. ... ... ...
500 237 1135.000 2615.000
501 238 1145.000 2615.000
You can draw nodes and edges separately. Use the following to only draw the nodes:
nodes=nx.draw_networkx_nodes(G)
If you want to pass the specific position of the nodes you may want to create the pos out of the x and y values. (At that point I would rather not use networkx...)
See the docs...

Networkx Remove NaN Node in graph and its edges but keep the connected nodes

I want to visualize different sequences of data with Networkx. Not all sequences have a destination point but I want to show these "single" sequences as a single circle in the plot (if they also appear in a different sequence e.g. "A" also occures in "AB" then the "A"-Circle of the "AB" sequence should get bigger and there shouldnt appear another "A" Circle.).
My raw input data looks like this:
Source
Target
A
B
A
B
L
A
G
C
M
M
My desired output with this example sequences looks like this (ignoring colors):
My code looks like this so far: (Right now the size of the nodes is depending on the amount of edges it has, I also want to change it to how often a Nodes appears in the data)
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
dataset = pd.read_csv(r'confidential_data.csv')
df = dataset.copy()
df = df.groupby(df.columns.tolist(),as_index=False, dropna = False).size()
df['size'] = df['size'].div(100).round(2) #dividing by 100 because the thickness of the lines would get too big
G=nx.from_pandas_edgelist(df, 'source', 'target', edge_attr='size', create_using=nx.DiGraph() )
d = dict(G.degree)
widths = nx.get_edge_attributes(G, 'size')
nodelist = G.nodes()
plt.figure(figsize=(12,8))
pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G,pos,
nodelist=nodelist,
node_size= [v * 400 for v in d.values()],
node_color='orange',
alpha=0.7)
nx.draw_networkx_edges(G,pos = pos,
edgelist = widths.keys(),
width=list(widths.values()),
edge_color='blue',
alpha=0.6,)
nx.draw_networkx_labels(G, pos=pos,
labels=dict(zip(nodelist,nodelist)),
font_color='black')
edge_labels = nx.get_edge_attributes(G, "size")
nx.draw_networkx_edge_labels(G, pos=pos, edge_labels = edge_labels, label_pos=0.2,font_size=15)
plt.box(False)
plt.show()
with this my plot would look something like this with the example data:
here you see the single occurences "A" and "L" in the table refer to NaN. I dont want the connection to the NaN. Is there either a way to remove the node NaN with all its edges (without hurting the other nodes) or to plot data without a target point at all?
If you know a solution or maybe have a different kind of visualization in mind for this kind of data and problem, please let me know.
Thank you in advance!

Why does networkx reduce number of nodes after adding edges

I need to start this by saying that my code runs without any error messages, but I don't understand some of the results.
I create a graph in networkx from a pandas data frame, that has 398595 integer IDs.
# Create Graph
G = nx.Graph()
G.name = "Graph from Pandas"
# Add Nodes to Graph
G.add_nodes_from(test_df['ID'].tolist())
print(nx.info(G))
The output from nx.info(G) is as follows, which is also correct this is what I expected:
Type: Graph
Number of nodes: 398595
Number of edges: 0
Average degree: 0.0000
Then I load a second pandas data frame and it contains 5556353 entries and has three columns:
ID1 ID2 weight
3 198 0.601002
3 183 0.618057
Each ID in ID1 or ID2 exists also into the first pandas dataframe, so I load the edges as follows:
# Add data to Graph
G = nx.from_pandas_edgelist(df,source='ID1',target='ID2', edge_attr='weight')
print(nx.info(G))
However here is what I don't understand, the output from nx.info(G) now returns:
Type: Graph
Number of nodes: 29348
Number of edges: 4371353
Average degree: 297.8978
Now my questions are (1) why are there fewer nodes in this graph than before and (2) why are there considerably fewer edges in this Graph than available from the data frame?
There are probably less unique IDs between ID1 and ID2 of df than there are in the ID column of test_df. The first thing I would check is if the unique IDs across ID1 and ID2 in df equals the number of nodes you display len(pd.unique(df[['ID1','ID2']].values.ravel())) (should equal 29348).
One reason there are fewer edges is if there are directed edges in the dataframe. The default value for the create_using parameter of nx.from_pandas_edgelist is nx.Graph() so edges will be treated as undirected and multiple edges are removed. If you want directed edges, multiple edges, or both, try passing nx.DiGraph,nx.MultiGraph, or nx.MultiDiGraph respectively to the create_using parameter.

Python: Network Spring Layout with different color nodes

I create a spring layout network of the shortest path from a given node. In this case firm1. I want to have a different color for each degree of separation. For instance, all the first edge connecting firm1 and the other firms, say firm2 and firm3, I would like to change the node color of firm2 and firm3 (same color for both). Then all the firms connected from firm2 and firm3, say firm4 and firm5 I want to change their node colors. But I don't know how to change the colors of the node for each degree of separation starting from firm1. Here's my code:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
graph = nx.Graph()
with open('C:\\file.txt') as f: #Here, I load a text file with two columns indicating the connections between each firm
for line in f:
tic_1, tic_2 = line.split()
graph.add_edge(tic_1, tic_2)
paths_from_1 = nx.shortest_path(graph, "firm1") #I get the shortest path starting from firm1
x = pd.DataFrame(paths_from_1.values()) #I convert the dictionary of the shortest path into a dataframe
tic_0=x[0].tolist() #there are 7 columns in my dataframe x and I convert each columns into a list. tic_0 is a list of `firm1` string
tic_1=x[1].tolist() #tic_1 is list of all the firms directly connected to firm1
tic_2=x[2].tolist() #tic_2 are the firms indirectly connected to firm1 via the firms in tic_1
tic_3=x[3].tolist() #and so on...
tic_4=x[4].tolist()
tic_5=x[5].tolist()
tic_6=x[6].tolist()
l = len(tic_0)
graph = nx.Graph()
for i in range(len(tic_0)):
graph.add_edge(tic_0[i], tic_1[i])
graph.add_edge(tic_1[i], tic_2[i])
graph.add_edge(tic_2[i], tic_3[i])
graph.add_edge(tic_3[i], tic_4[i])
graph.add_edge(tic_4[i], tic_5[i])
graph.add_edge(tic_5[i], tic_6[i])
pos = nx.spring_layout(graph_short, iterations=200, k=)
nx.draw(graph_short, pos, font_size='6',)
plt.savefig("network.png")
plt.show()
How can I have different color nodes for each degree of separation? In other words, all the firms in tic_1 should have a node that is blue, all the firms in tic_2 has a yellow node color, etc.
The generic way to do this is to run the shortest path length algorithm from a source node to assign the colors. Here is an example:
import matplotlib.pyplot as plt
import networkx as nx
G = nx.balanced_tree(2,5)
length = nx.shortest_path_length(G, source=0)
nodelist,hops = zip(*length.items())
positions = nx.graphviz_layout(G, prog='twopi', root=0)
nx.draw(G, positions, nodelist = nodelist, node_color=hops, cmap=plt.cm.Blues)
plt.axis('equal')
plt.show()
You could use
positions = nx.spring_layout(G)
instead. I used the graphviz circo layout since it does a better job at drawing the balanced tree I used.

Categories