I'm trying to generate a network through a dataframe like the following:
import pandas as pd
import networkx as nx
df1 = pd.DataFrame({'id_emp' : [1,2,3,4,5],
'roi': ['positive', 'negative', 'positive', 'negative', 'negative'],
'description': ['middle', 'low', 'middle', 'high', 'low']})
df1 = df1.set_index('id_emp')
On the network that I am trying to develop, the nodes represent the values of the id_emp column. And there are edges between two nodes if the roi AND description column values are the same. Here is the code I'm using to develop:
G = nx.Graph()
G.add_nodes_from([a for a in df1.index])
for cr in set(df1['roi']):
indices = df1[df1['roi']==cr].index
G.add_edges_from(it.product(indices, indices))
for d in set(df1['description']):
indices = df1[df1['description']==d].index
G.add_edges_from(it.product(indices,indices))
pos = nx.kamada_kawai_layout(G)
plt.figure(figsize=(3,3))
nx.draw(G,pos,node_size = 100, width = 0.5,with_labels=True)
plt.show()
Output:
Problem: Edges are being generated for nodes as equal values in the description OR roi columns. In the given example, node 4 should have no connection because it has a different value in the description column.
What should I do to analyze the two conditions together to have an edge between two nodes?
I'm not sure why you're using a graph theory tool in such case. NetworkX would be interesting here if you wanted to find the connected components for instance (i.e linked nodes).
However if two given edges must connect exactly the same nodes for them to be considered as being part of the same component, that is essentially the same as obtaining a list of duplicate rows in the dataframe, which could be achieved by:
df1.roi.str.cat(df1.description, sep='-').reset_index().groupby('roi').id_emp.apply(list)
roi
negative-high [4]
negative-low [2, 5]
positive-middle [1, 3]
Name: id_emp, dtype: object
Related
I have all my data stored in a geopandas geodataframe, let's call it df1.
I make a temporary copy df2
df2 = df1
I initiate a PointCloud() object, add the pointcloud and colours in the usual way
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(np.asarray(df2[['x', 'y', 'z']]))
pcd.colors = o3d.utility.Vector3dVector(np.asarray(df2[['red', 'green', 'blue']] / 65535))
I run a statistical outlier
cl, ind = pcd.remove_statistical_outlier(nb_neighbors=nb, std_ratio=ratio)
I now want to add a column to the original df marking those that have been filtered out by the statistical outlier algorithm as False and those remaining as True. So I use the ind which is just an index list of the kept points.
df1['stat_filtered'] = np.where(df1.index.isin(ind), True, False)
To my mind this should add True or False to the df1['stat_filtered'] column relative to the index.
I can then repeat the process, this time filtering on the new column to selecting onlt those that are true.
df2 = df1.loc[(df1['stat_filtered'] == True)]
However, open3d works in numpy arrays, so once I have converted the dataframe to a pcd I imagine the original index is lost?
I could do a spatial join to reassign the true values I suppose, but that sounds computationally intensive and with large pointclouds there may be identical points which will cause issues. Is there any way to retain the original index so I can just update the dataframe's column in the manner I outline above?
I want to visualize different sequences of data with Networkx. Not all sequences have a destination point but I want to show these "single" sequences as a single circle in the plot (if they also appear in a different sequence e.g. "A" also occures in "AB" then the "A"-Circle of the "AB" sequence should get bigger and there shouldnt appear another "A" Circle.).
My raw input data looks like this:
Source
Target
A
B
A
B
L
A
G
C
M
M
My desired output with this example sequences looks like this (ignoring colors):
My code looks like this so far: (Right now the size of the nodes is depending on the amount of edges it has, I also want to change it to how often a Nodes appears in the data)
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
dataset = pd.read_csv(r'confidential_data.csv')
df = dataset.copy()
df = df.groupby(df.columns.tolist(),as_index=False, dropna = False).size()
df['size'] = df['size'].div(100).round(2) #dividing by 100 because the thickness of the lines would get too big
G=nx.from_pandas_edgelist(df, 'source', 'target', edge_attr='size', create_using=nx.DiGraph() )
d = dict(G.degree)
widths = nx.get_edge_attributes(G, 'size')
nodelist = G.nodes()
plt.figure(figsize=(12,8))
pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G,pos,
nodelist=nodelist,
node_size= [v * 400 for v in d.values()],
node_color='orange',
alpha=0.7)
nx.draw_networkx_edges(G,pos = pos,
edgelist = widths.keys(),
width=list(widths.values()),
edge_color='blue',
alpha=0.6,)
nx.draw_networkx_labels(G, pos=pos,
labels=dict(zip(nodelist,nodelist)),
font_color='black')
edge_labels = nx.get_edge_attributes(G, "size")
nx.draw_networkx_edge_labels(G, pos=pos, edge_labels = edge_labels, label_pos=0.2,font_size=15)
plt.box(False)
plt.show()
with this my plot would look something like this with the example data:
here you see the single occurences "A" and "L" in the table refer to NaN. I dont want the connection to the NaN. Is there either a way to remove the node NaN with all its edges (without hurting the other nodes) or to plot data without a target point at all?
If you know a solution or maybe have a different kind of visualization in mind for this kind of data and problem, please let me know.
Thank you in advance!
I am new to NetworkX package in python. I want to solve the following problem.
lets say this is my data set:
import pandas as pd
d = {'label': [1, 2, 3, 4, 5], 'size': [10, 8, 6, 4, 2], 'dist': [0, 2, -2, 4, -4]}
df = pd.DataFrame(data=d)
df
label and size in the df are quite self-explanatory. The dist column measures the distance from the biggest label (label 1) to the rest of the labels. Hence dist is 0 in the case of label 1.
I want to produce something similar to the picture below:
Where the biggest label in size is in a central position (1abel 1). Edges are the distance from label 1 to all other labels and the size of nodes are proportional to the size of each label. Is it possible?
Thank you very much in advance. Please let me know if the question is unclear.
import matplotlib.pyplot as plt
import networkx as nx
G = nx.Graph()
for _, row in df.iterrows():
G.add_node(row['label'], pos=(row['dist'], 0), size=row['size'])
biggest_node = 1
for node in G.nodes:
if node != biggest_node:
G.add_edge(biggest_node, node)
nx.draw(G,
pos={node: attrs['pos'] for node, attrs in G.nodes.items()},
node_size=[node['size'] * 100 for node in G.nodes.values()],
with_labels=True
)
plt.show()
Which plots
Notes:
You will notice the edges in 1-3 and 1-2 are thicker, because they overlap with the edge sections from 1-5 and 1-4 respectively. You can address that by having one only one edge from the center to the furthest node out in each direction and since every node will be on the same line, it'll look the same.
coords = [(attrs['pos'][0], node) for node, attrs in G.nodes.items()]
nx.draw(G,
# same arguments as before and also add
edgelist=[(biggest_node, min(coords)[1]), (biggest_node, max(coords)[1])]
)
The 100 factor in the list for the node_size argument is just a scaling factor. You can change that to whatever you want.
I need to start this by saying that my code runs without any error messages, but I don't understand some of the results.
I create a graph in networkx from a pandas data frame, that has 398595 integer IDs.
# Create Graph
G = nx.Graph()
G.name = "Graph from Pandas"
# Add Nodes to Graph
G.add_nodes_from(test_df['ID'].tolist())
print(nx.info(G))
The output from nx.info(G) is as follows, which is also correct this is what I expected:
Type: Graph
Number of nodes: 398595
Number of edges: 0
Average degree: 0.0000
Then I load a second pandas data frame and it contains 5556353 entries and has three columns:
ID1 ID2 weight
3 198 0.601002
3 183 0.618057
Each ID in ID1 or ID2 exists also into the first pandas dataframe, so I load the edges as follows:
# Add data to Graph
G = nx.from_pandas_edgelist(df,source='ID1',target='ID2', edge_attr='weight')
print(nx.info(G))
However here is what I don't understand, the output from nx.info(G) now returns:
Type: Graph
Number of nodes: 29348
Number of edges: 4371353
Average degree: 297.8978
Now my questions are (1) why are there fewer nodes in this graph than before and (2) why are there considerably fewer edges in this Graph than available from the data frame?
There are probably less unique IDs between ID1 and ID2 of df than there are in the ID column of test_df. The first thing I would check is if the unique IDs across ID1 and ID2 in df equals the number of nodes you display len(pd.unique(df[['ID1','ID2']].values.ravel())) (should equal 29348).
One reason there are fewer edges is if there are directed edges in the dataframe. The default value for the create_using parameter of nx.from_pandas_edgelist is nx.Graph() so edges will be treated as undirected and multiple edges are removed. If you want directed edges, multiple edges, or both, try passing nx.DiGraph,nx.MultiGraph, or nx.MultiDiGraph respectively to the create_using parameter.
Given a dependency parse graph, if I want to find the shortest path length between two fixed nodes, this is how I've coded it:
nx.shortest_path_length (graph, source='cost', target='20.4')
My question here is: What if I want to match for all sentences in the graph or collection a target with any number formatted approximately as a currency? Would I have to first find every node in the graph that is a currency, and then iterate over the set of currency values?
It would be ideal to have:
nx.shortest_path_length (graph, source='cost', target=r'^[$€£]?(\d+([\.,]00)?)$')
Or from #bluepnume ^[$€£]?((([1-5],?)?\d{2,3}|[5-9])(\.\d{2})?)$
You could do it in two steps, without having to loop over.
Step 1: Calculate the shortest distance from your 'cost' node to all reachable nodes.
Step 2: Subset (using regex) just the currency nodes that you are interested in.
Here's an example to illustrate.
import networkx as nx
import matplotlib.pyplot as plt
import re
g = nx.DiGraph()
#create a dummy graph for illustration
g.add_edges_from([('cost','apples'),('cost', 'of'),
('$2', 'pears'),('lemon', '£1.414'),
('apples', '$2'),('lemon', '£1.414'),
('€3.5', 'lemon'),('pears', '€3.5'),
], distance=0.5) # using a list of edge tuples & specifying distance
g.add_edges_from([('€3.5', 'lemon'),('of', '€3.5')],
distance=0.7)
nx.draw(g, with_labels=True)
which produces:
Now, you can calculate the shortest paths to your nodes of interest, subsetting using regex like you wanted to.
paths = nx.single_source_dijkstra_path(g, 'cost')
lengths=nx.single_source_dijkstra_path_length(g,'cost', weight='distance')
currency_nodes = [ n for n in lengths.keys() if re.findall('(\$|€|£)',n)]
[(n,len) for (n,len) in lengths.items() if n in currency_nodes]
produces:
[('$2', 1.0), ('€3.5', 1.2), ('£1.414', 2.4)]
Hope that helps you move forward.