Why does networkx reduce number of nodes after adding edges - python

I need to start this by saying that my code runs without any error messages, but I don't understand some of the results.
I create a graph in networkx from a pandas data frame, that has 398595 integer IDs.
# Create Graph
G = nx.Graph()
G.name = "Graph from Pandas"
# Add Nodes to Graph
G.add_nodes_from(test_df['ID'].tolist())
print(nx.info(G))
The output from nx.info(G) is as follows, which is also correct this is what I expected:
Type: Graph
Number of nodes: 398595
Number of edges: 0
Average degree: 0.0000
Then I load a second pandas data frame and it contains 5556353 entries and has three columns:
ID1 ID2 weight
3 198 0.601002
3 183 0.618057
Each ID in ID1 or ID2 exists also into the first pandas dataframe, so I load the edges as follows:
# Add data to Graph
G = nx.from_pandas_edgelist(df,source='ID1',target='ID2', edge_attr='weight')
print(nx.info(G))
However here is what I don't understand, the output from nx.info(G) now returns:
Type: Graph
Number of nodes: 29348
Number of edges: 4371353
Average degree: 297.8978
Now my questions are (1) why are there fewer nodes in this graph than before and (2) why are there considerably fewer edges in this Graph than available from the data frame?

There are probably less unique IDs between ID1 and ID2 of df than there are in the ID column of test_df. The first thing I would check is if the unique IDs across ID1 and ID2 in df equals the number of nodes you display len(pd.unique(df[['ID1','ID2']].values.ravel())) (should equal 29348).
One reason there are fewer edges is if there are directed edges in the dataframe. The default value for the create_using parameter of nx.from_pandas_edgelist is nx.Graph() so edges will be treated as undirected and multiple edges are removed. If you want directed edges, multiple edges, or both, try passing nx.DiGraph,nx.MultiGraph, or nx.MultiDiGraph respectively to the create_using parameter.

Related

Networkx Remove NaN Node in graph and its edges but keep the connected nodes

I want to visualize different sequences of data with Networkx. Not all sequences have a destination point but I want to show these "single" sequences as a single circle in the plot (if they also appear in a different sequence e.g. "A" also occures in "AB" then the "A"-Circle of the "AB" sequence should get bigger and there shouldnt appear another "A" Circle.).
My raw input data looks like this:
Source
Target
A
B
A
B
L
A
G
C
M
M
My desired output with this example sequences looks like this (ignoring colors):
My code looks like this so far: (Right now the size of the nodes is depending on the amount of edges it has, I also want to change it to how often a Nodes appears in the data)
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
dataset = pd.read_csv(r'confidential_data.csv')
df = dataset.copy()
df = df.groupby(df.columns.tolist(),as_index=False, dropna = False).size()
df['size'] = df['size'].div(100).round(2) #dividing by 100 because the thickness of the lines would get too big
G=nx.from_pandas_edgelist(df, 'source', 'target', edge_attr='size', create_using=nx.DiGraph() )
d = dict(G.degree)
widths = nx.get_edge_attributes(G, 'size')
nodelist = G.nodes()
plt.figure(figsize=(12,8))
pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G,pos,
nodelist=nodelist,
node_size= [v * 400 for v in d.values()],
node_color='orange',
alpha=0.7)
nx.draw_networkx_edges(G,pos = pos,
edgelist = widths.keys(),
width=list(widths.values()),
edge_color='blue',
alpha=0.6,)
nx.draw_networkx_labels(G, pos=pos,
labels=dict(zip(nodelist,nodelist)),
font_color='black')
edge_labels = nx.get_edge_attributes(G, "size")
nx.draw_networkx_edge_labels(G, pos=pos, edge_labels = edge_labels, label_pos=0.2,font_size=15)
plt.box(False)
plt.show()
with this my plot would look something like this with the example data:
here you see the single occurences "A" and "L" in the table refer to NaN. I dont want the connection to the NaN. Is there either a way to remove the node NaN with all its edges (without hurting the other nodes) or to plot data without a target point at all?
If you know a solution or maybe have a different kind of visualization in mind for this kind of data and problem, please let me know.
Thank you in advance!

Generate Network plot from large pandas dataframe

Suppose I have this dataframe df that contains 3794 rows x 2 columns, where column a-number represents nodes with directed edges to nodes in b-number:
a_number b_number
0 0123456789343 0123456789991
1 0123456789343 0123456789633
2 0123456789343 0123456789633
3 0123456789343 0123456789628
4 0123456789343 0123456789633
... ... ...
3789 0123456789697 0123456789916
3790 0123456789697 0123456789886
3791 0123456789697 0123456789572
3792 0123456789697 0123456789884
3793 0123456789697 0123456789125
3794 rows × 2 columns
Additional information:
len(df['a_number'].unique())
>>> 18
len(df['b_number'].unique())
>>>1145
I am trying to generate an image representation of the graph. Here's code to apply networkx:
import networkx as nx
G = nx.DiGraph()
for i, (x, y) in df.iterrows():
G.add_node(x)
G.add_node(y)
G.add_edge(x,y)
nx.draw(G, with_labels = True, font_size=14 , node_size=2000)
I get this output:
I am having some problems in visualizing the graphs created with python-networkx, I want to able to reduce clutter and regulate the distance between the nodes. Please advise. What can I do on the code? thank you.
First to reduce the clutter I would start by decreasing the node size, to maybe 200 or 400.
Try to reduce the font_size parameter in the draw function. This parameter regulate the size of the node's labels. Since you have large node names, it will help reduce the clutter.
If having the labels on the graph is not necessary, then remove them to make it cleaner by passing the with_labels=False to the draw function.
Then to regulate the distance between nodes you an use the spring layout for the nodes position.
pos = nx.spring_layout(G, k=0.8)
nx.draw(G, pos , with_labels = True, font_size=7, node_size=400)
The k parameter in the spring layout allows you to regulate distance between nodes. You can try different values to see what suit you most.

Networkx: Multiple conditions for edges

I'm trying to generate a network through a dataframe like the following:
import pandas as pd
import networkx as nx
df1 = pd.DataFrame({'id_emp' : [1,2,3,4,5],
'roi': ['positive', 'negative', 'positive', 'negative', 'negative'],
'description': ['middle', 'low', 'middle', 'high', 'low']})
df1 = df1.set_index('id_emp')
On the network that I am trying to develop, the nodes represent the values ​​of the id_emp column. And there are edges between two nodes if the roi AND description column values ​​are the same. Here is the code I'm using to develop:
G = nx.Graph()
G.add_nodes_from([a for a in df1.index])
for cr in set(df1['roi']):
indices = df1[df1['roi']==cr].index
G.add_edges_from(it.product(indices, indices))
for d in set(df1['description']):
indices = df1[df1['description']==d].index
G.add_edges_from(it.product(indices,indices))
pos = nx.kamada_kawai_layout(G)
plt.figure(figsize=(3,3))
nx.draw(G,pos,node_size = 100, width = 0.5,with_labels=True)
plt.show()
Output:
Problem: Edges are being generated for nodes as equal values ​​in the description OR roi columns. In the given example, node 4 should have no connection because it has a different value in the description column.
What should I do to analyze the two conditions together to have an edge between two nodes?
I'm not sure why you're using a graph theory tool in such case. NetworkX would be interesting here if you wanted to find the connected components for instance (i.e linked nodes).
However if two given edges must connect exactly the same nodes for them to be considered as being part of the same component, that is essentially the same as obtaining a list of duplicate rows in the dataframe, which could be achieved by:
df1.roi.str.cat(df1.description, sep='-').reset_index().groupby('roi').id_emp.apply(list)
roi
negative-high [4]
negative-low [2, 5]
positive-middle [1, 3]
Name: id_emp, dtype: object

How to find nodes with Python string matching functions in Networkx?

Given a dependency parse graph, if I want to find the shortest path length between two fixed nodes, this is how I've coded it:
nx.shortest_path_length (graph, source='cost', target='20.4')
My question here is: What if I want to match for all sentences in the graph or collection a target with any number formatted approximately as a currency? Would I have to first find every node in the graph that is a currency, and then iterate over the set of currency values?
It would be ideal to have:
nx.shortest_path_length (graph, source='cost', target=r'^[$€£]?(\d+([\.,]00)?)$')
Or from #bluepnume ^[$€£]?((([1-5],?)?\d{2,3}|[5-9])(\.\d{2})?)$
You could do it in two steps, without having to loop over.
Step 1: Calculate the shortest distance from your 'cost' node to all reachable nodes.
Step 2: Subset (using regex) just the currency nodes that you are interested in.
Here's an example to illustrate.
import networkx as nx
import matplotlib.pyplot as plt
import re
g = nx.DiGraph()
#create a dummy graph for illustration
g.add_edges_from([('cost','apples'),('cost', 'of'),
('$2', 'pears'),('lemon', '£1.414'),
('apples', '$2'),('lemon', '£1.414'),
('€3.5', 'lemon'),('pears', '€3.5'),
], distance=0.5) # using a list of edge tuples & specifying distance
g.add_edges_from([('€3.5', 'lemon'),('of', '€3.5')],
distance=0.7)
nx.draw(g, with_labels=True)
which produces:
Now, you can calculate the shortest paths to your nodes of interest, subsetting using regex like you wanted to.
paths = nx.single_source_dijkstra_path(g, 'cost')
lengths=nx.single_source_dijkstra_path_length(g,'cost', weight='distance')
currency_nodes = [ n for n in lengths.keys() if re.findall('(\$|€|£)',n)]
[(n,len) for (n,len) in lengths.items() if n in currency_nodes]
produces:
[('$2', 1.0), ('€3.5', 1.2), ('£1.414', 2.4)]
Hope that helps you move forward.

Is it possible to mix different shaped nodes in a networkx graph?

I have an XML file with two different types of node (let's call it node 'a' and node 'b'). They are connected. I want to present node 'a' as - let's say a square, and node 'b' as a circle. Is this possible in networkx and Python?
My original plan was to declare two graphs:
AG=nx.DiGraph()
BG=nx.DiGraph()
Then add nodes to each of these depending on what type of node it is - so I would iterate through the XML and if the node was a 'type a', then add it to AG and if it is a type B add it to BG
Now I can display each graph and define its' shape - typically:
nx.draw_networkx(EG, font_family="Arial", font_size=10,
node_size=2000, node_shape="s", node_color='r', labels=node_labels, with_labels=True)
The part where it falls over is when I try to add edges between an 'a' node and a 'b' node.
Any ideas on how that might work?
Keep the entire dataset in one Graph:
G = nx.DiGraph()
Also keep lists of A_nodes, and B_nodes. Then you can select subgraphs of G using
AG = G.subgraph(A_nodes)
BG = G.subgraph(B_nodes)
Since all the nodes are in G, you can now add edges between A-nodes and B-nodes:
G.add_edge(a_node, b_node)

Categories