I have a dataframe with 2 columns: "emp" is the child column and "man" is the parent column. I need to count the total number of children ( direct/indirect) for any given parent.
emp man
23ank(5*) 213raj(11*)
55man(5*) 213raj(11*)
2shu(1*) 23ank(5*)
7am(3*) 55man(5*)
9shi(0*) 55man(5*)
213raj(11*) 66sam(13*)
The solution I am looking for is if, for instance, I want the details related to 213raj(11*), then:
213raj(11*),23ank(5*),2shu(1*),55man(5*),7am(3*),9shi(0*)
and the total count for 213raj(11*) =5.
If I consider 66sam(13*) then:
66sam(13*),213raj(11*),23ank(5*),2shu(1*),55man(5*),7am(3*),9shi(0*)
and the total count for 66sam(13*) =6
I tried the code below but am not getting the required results:
kv = kvpp[['emp','man']]
kvp = dict(zip(kv.emp,kv.man))
parents = set()
children = {}
for c,p in kvp.items():
parents.add(p)
children[c] = p
def ancestors(p):
return (ancestors(children[p]) if p in children else []) + [p]
pp = []
for k in (set(children.keys()) - parents):
pp.append('/'.join(ancestors(k)))
In graph theory terms, you have an adjacency matrix forming a directed acyclic graph.
Here's a solution using the NetworkX graph theory library.
import networkx as nx
emp_to_man = [
('23ank(5*)', '213raj(11*)'),
('55man(5*)', '213raj(11*)'),
('2shu(1*)', '23ank(5*)'),
('7am(3*)', '55man(5*)'),
('9shi(0*)', '55man(5*)'),
('213raj(11*)', '66sam(13*)'),
]
# Create a directed graph using the adjacency matrix.
# Converting a 2-column DF into a digraph is as easy as
# `nx.DiGraph(list(df.values))`.
g = nx.DiGraph(emp_to_man)
for emp in sorted(g): # For every employee (in sorted order for tidiness),
# ... print the set of ancestors (in no particular order).
# Should the adjacency matrix be `man_to_emp` instead, you'd use `
print(emp, nx.ancestors(g, emp))
This prints out
213raj(11*) {'55man(5*)', '7am(3*)', '2shu(1*)', '9shi(0*)', '23ank(5*)'}
23ank(5*) {'2shu(1*)'}
2shu(1*) set()
55man(5*) {'9shi(0*)', '7am(3*)'}
66sam(13*) {'213raj(11*)', '55man(5*)', '7am(3*)', '9shi(0*)', '2shu(1*)', '23ank(5*)'}
7am(3*) set()
9shi(0*) set()
EDIT: In case performance is paramount, I'd heartily suggest the NetworkX approach. Based on a quick timeit test, finding all the employees is roughly 62 times faster than the Pandas-based code, and that's converting the DF into an NX network on every invocation.
EDIT 2: To my rather great surprise, a naïve set/defaultdict graph traversal is faster still -- 387 times faster than the Pandas code and 5 times faster than the Nx code above.
def dag_count_all_children():
dag = collections.defaultdict(set)
for man, emp in df.values:
dag[emp].add(man)
out = {}
for man in set(dag):
found = set()
open = {man}
while open:
emp = open.pop()
open.update(dag[emp] - found)
found.update(dag[emp])
out[man] = found
return out
If I've understood your question correctly, this function should give you the correct answers:
import pandas as pd
df = pd.DataFrame({'emp':['23ank(5*)', '55man(5*)', '2shu(1*)', '7am(3*)', '9shi(0*)', '213raj(11*)'],
'man':['213raj(11*)', '213raj(11*)', '23ank(5*)', '55man(5*)', '55man(5*)', '66sam(13*)']})
def count_children(parent):
total_children = [] # initialise list of children to append to
direct = df[df['man'] == parent]['emp'].to_list()
total_children += direct # add direct children
indirect = df[df['man'].isin(direct)]['emp'].to_list()
total_children += indirect # add indirect children
# next, add children of indirect children in a loop
next_indirect = indirect
while True:
next_indirect = df[df['man'].isin(next_indirect)]['emp'].to_list()
if not next_indirect or all(i in total_children for i in next_indirect):
break
else:
total_children = list(set(next_indirect).union(set(total_children)))
count = len(total_children)
return pd.DataFrame({'count':count,
'children':','.join(total_children)},
index=[parent])
count_children('213raj(11*)') -> 5
count_children('66sam(13*)') -> 6
Related
I am trying to implement the Chinese Whispers Algorithm, but I can not figure out the issue below: the result that I want to get is like in the picture below
import networkx as nx
from random import shuffle as shuffle
# build nodes and edge lists
nodes = [
(1,{'attr1':1}),
(2,{'attr1':1})
]
edges = [
(1,2,{'weight': 0.732})
]
# initialize the graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
G.node[v[0]]["class"] = v[1]["attr1"]
# add edges
G.add_edges_from(edges)
# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
gn = G.nodes()
# I randomize the nodes to give me an arbitrary start point
shuffle(gn)
for node in gn:
neighs = G[node]
classes = {}
# do an inventory of the given nodes neighbours and edge weights
for ne in neighs:
if isinstance(ne, int) :
if G.node[ne]['class'] in classes:
classes[G.node[ne]['class']] += G[node][ne]['weight']
else:
classes[G.node[ne]['class']] = G[node][ne]['weight']
# find the class with the highest edge weight sum
max = 0
maxclass = 0
for c in classes:
if classes[c] > max:
max = classes[c]
maxclass = c
# set the class of target node to the winning local class
G.node[node]['class'] = maxclass
I want to make this output
enter image description here
As noted in the answer by #Shridhar R Kulkarni, you want to use G.nodes when referencing nodes for updating/adding attributes.
Another problem with the script is that for shuffling, you want only the list of node identifiers:
gn = list(G.nodes()) # use list() to allow shuffling
shuffle(gn)
If you are interested in just the computation, rather than (re)implementation, you could also use an existing library chinese-whispers.
Use G.nodes instead of G.node.
I am trying to generate a D-ary balanced tree in python using the networkx package.
import networkx as nx
g=nx.Graph()
D= int(input("enter number of children of a node:"));
L=int(input("Enter the number of levels:"));
#variable to store the total number of nodes in the tree.
tot_node=0;
for i in range(0,L+1):
tot_node=tot_node+D**i;
for N in range(1,tot_node):
for j in range(N,N+D):
g.add_edge(N,j);
nx.draw(g);
For this I am getting the following tree for D=2 and L=3.
enter image description here
Can someone please point out the error in this code? I want to construct a balanced tree for any general D (the number of branches of a node).
I have updated the code again to make sure the general cases work. I hope I have not made this more complicated than necessary, I feel like there must be some simpler implementation, maybe one that relies on recursion.
Anyways, I have produced what I think is an acceptable result. Although it is not your code directly, I believe I have implemented something along the lines of the rudimentary solution you want:
import matplotlib.pyplot as plt
import networkx as nx
from networkx import Graph
#We make a node class to track which node to modify (modify here means add children to.)
class Node:
def __init__(self, node_id, has_children, not_connected):
self.node_id = node_id
self.has_children = has_children
self.not_connected = not_connected
def get_min_not_connected(nodes_tracker):
smallest = float('inf')
for node in nodes_tracker:
#print(f"Is the node {node.node_id} not connected: {node.not_connected}")
if node.node_id < smallest and node.not_connected:
smallest = node.node_id
return smallest-1
def construction_step(G, node_id, num_children, nodes_tracker):
#print(f"The range is {len(nodes_tracker)+1} to {len(nodes_tracker)+num_children+1}")
#I am just creating new Node objects to track which connections have been made here. Note how the third parameter of not connected is True.
nodes_tracker = nodes_tracker + [Node(i,False,True) for i in range(len(nodes_tracker)+1, len(nodes_tracker)+num_children+1)]
for i in range(1, num_children+1):
print(f'adding edge relation ({node_id}, {get_min_not_connected(nodes_tracker)+i})')
#Here I am adding the child nodes to the parent ones.
G.add_edge(node_id, get_min_not_connected(nodes_tracker)+i)
for i in range(1, num_children+1):
#print(get_min_not_connected(nodes_tracker))
nodes_tracker[get_min_not_connected(nodes_tracker)].not_connected = False
return nodes_tracker
#Hardcode inputs for your specific example.
#I am using num_children in place of your D variable.
num_children=3
L=2
G=nx.Graph()
#Create the central (initial) node and setup
total_nodes = 0
#correct formula is like 2^0+2^1+...+2^L
for i in range(0,L):
total_nodes += num_children**i
print(total_nodes)
nodes_tracker = [Node(1,False,False)]
#Create the actual d-ary graph here.
for i in range(1, total_nodes+1):
nodes_tracker = construction_step(G, i, num_children, nodes_tracker)
#print(len(nodes_tracker))
nx.draw(G);
plt.show()
For the output with your parameters D=2, L=3, I got:
To test a more general case, I used D=4, L=2 and I got:
And for fun D=5, L=3:
It works with bigger D and L as well, but the charts naturally look very ugly.
Thanks for your patience with this answer and I hope this helps.
The idea is to compute resilience of the network presented as an undirected graph in form
{node: (set of its neighbors) for each node in the graph}.
The function removes nodes from the graph in random order one by one and calculates the size of the largest remaining connected component.
The helper function bfs_visited() returns the set of nodes that are still connected to the given node.
How can I improve the implementation of the algorithm in Python 2? Preferably without changing the breadth-first algorithm in the helper function
def bfs_visited(graph, node):
"""undirected graph {Vertex: {neighbors}}
Returns the set of all nodes visited by the algrorithm"""
queue = deque()
queue.append(node)
visited = set([node])
while queue:
current_node = queue.popleft()
for neighbor in graph[current_node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return visited
def cc_visited(graph):
""" undirected graph {Vertex: {neighbors}}
Returns a list of sets of connected components"""
remaining_nodes = set(graph.keys())
connected_components = []
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
#print(node, remaining_nodes)
return connected_components
def largest_cc_size(ugraph):
"""returns the size (an integer) of the largest connected component in
the ugraph."""
if not ugraph:
return 0
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
def compute_resilience(ugraph, attack_order):
"""
input: a graph {V: N}
returns a list whose k+1th entry is the size of the largest cc after
the removal of the first k nodes
"""
res = [len(ugraph)]
for node in attack_order:
neighbors = ugraph[node]
for neighbor in neighbors:
ugraph[neighbor].remove(node)
ugraph.pop(node)
res.append(largest_cc_size(ugraph))
return res
I received this tremendously great answer from Gareth Rees, which covers the question completely.
Review
The docstring for bfs_visited should explain the node argument.
The docstring for compute_resilience should explain that the ugraph argument gets modified. Alternatively, the function could take a copy of the graph so that the original is not modified.
In bfs_visited the lines:
queue = deque()
queue.append(node)
can be simplified to:
queue = deque([node])
The function largest_cc_size builds a list of pairs:
res = [(len(ccc), ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1][0]
But you can see that it only ever uses the first element of each pair (the size of the component). So you could simplify it by not building the pairs:
res = [len(ccc) for ccc in cc_visited(ugraph)]
res.sort()
return res[-1]
Since only the size of the largest component is needed, there is no need to build the whole list. Instead you could use max to find the largest:
if ugraph:
return max(map(len, cc_visited(ugraph)))
else:
return 0
If you are using Python 3.4 or later, this can be further simplified using the default argument to max:
return max(map(len, cc_visited(ugraph)), default=0)
This is now so simple that it probably doesn't need its own function.
This line:
remaining_nodes = set(graph.keys())
can be written more simply:
remaining_nodes = set(graph)
There is a loop over the set remaining_nodes where on each loop iteration you update remaining_nodes:
for node in remaining_nodes:
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes = remaining_nodes - visited
It looks as if the intention of the code to avoid iterating over the nodes in visited by removing them from remaining_nodes, but this doesn't work! The problem is that the for statement:
for node in remaining_nodes:
only evaluates the expression remaining_nodes once, at the start of the loop. So when the code creates a new set and assigns it to remaining_nodes:
remaining_nodes = remaining_nodes - visited
this has no effect on the nodes being iterated over.
You might imagine trying to fix this by using the difference_update method to adjust the set being iterated over:
remaining_nodes.difference_update(visited)
but this would be a bad idea because then you would be iterating over a set and modifying it within the loop, which is not safe. Instead, you need to write the loop as follows:
while remaining_nodes:
node = remaining_nodes.pop()
visited = bfs_visited(graph, node)
if visited not in connected_components:
connected_components.append(visited)
remaining_nodes.difference_update(visited)
Using while and pop is the standard idiom in Python for consuming a data structure while modifying it — you do something similar in bfs_visited.
There is now no need for the test:
if visited not in connected_components:
since each component is produced exactly once.
In compute_resilience the first line is:
res = [len(ugraph)]
but this only works if the graph is a single connected component to start with. To handle the general case, the first line should be:
res = [largest_cc_size(ugraph)]
For each node in attack order, compute_resilience calls:
res.append(largest_cc_size(ugraph))
But this doesn't take advantage of the work that was previously done. When we remove node from the graph, all connected components remain the same, except for the connected component containing node. So we can potentially save some work if we only do a breadth-first search over that component, and not over the whole graph. (Whether this actually saves any work depends on how resilient the graph is. For highly resilient graphs it won't make much difference.)
In order to do this we'll need to redesign the data structures so that we can efficiently find the component containing a node, and efficiently remove that component from the collection of components.
This answer is already quite long, so I won't explain in detail how to redesign the data structures, I'll just present the revised code and let you figure it out for yourself.
def connected_components(graph, nodes):
"""Given an undirected graph represented as a mapping from nodes to
the set of their neighbours, and a set of nodes, find the
connected components in the graph containing those nodes.
Returns:
- mapping from nodes to the canonical node of the connected
component they belong to
- mapping from canonical nodes to connected components
"""
canonical = {}
components = {}
while nodes:
node = nodes.pop()
component = bfs_visited(graph, node)
components[node] = component
nodes.difference_update(component)
for n in component:
canonical[n] = node
return canonical, components
def resilience(graph, attack_order):
"""Given an undirected graph represented as a mapping from nodes to
an iterable of their neighbours, and an iterable of nodes, generate
integers such that the the k-th result is the size of the largest
connected component after the removal of the first k-1 nodes.
"""
# Take a copy of the graph so that we can destructively modify it.
graph = {node: set(neighbours) for node, neighbours in graph.items()}
canonical, components = connected_components(graph, set(graph))
largest = lambda: max(map(len, components.values()), default=0)
yield largest()
for node in attack_order:
# Find connected component containing node.
component = components.pop(canonical.pop(node))
# Remove node from graph.
for neighbor in graph[node]:
graph[neighbor].remove(node)
graph.pop(node)
component.remove(node)
# Component may have been split by removal of node, so search
# it for new connected components and update data structures
# accordingly.
canon, comp = connected_components(graph, component)
canonical.update(canon)
components.update(comp)
yield largest()
In the revised code, the max operation has to iterate over all the remaining connected components in order to find the largest one. It would be possible to improve the efficiency of this step by storing the connected components in a priority queue so that the largest one can be found in time that's logarithmic in the number of components.
I doubt that this part of the algorithm is a bottleneck in practice, so it's probably not worth the extra code, but if you need to do this, then there are some Priority Queue Implementation Notes in the Python documentation.
Performance comparison
Here's a useful function for making test cases:
from itertools import combinations
from random import random
def random_graph(n, p):
"""Return a random undirected graph with n nodes and each edge chosen
independently with probability p.
"""
assert 0 <= p <= 1
graph = {i: set() for i in range(n)}
for i, j in combinations(range(n), 2):
if random() <= p:
graph[i].add(j)
graph[j].add(i)
return graph
Now, a quick performance comparison between the revised and original code. Note that we have to run the revised code first, because the original code destructively modifies the graph, as noted in §1.2 above.
>>> from timeit import timeit
>>> G = random_graph(300, 0.2)
>>> timeit(lambda:list(resilience(G, list(G))), number=1) # revised
0.28782312001567334
>>> timeit(lambda:compute_resilience(G, list(G)), number=1) # original
59.46968446299434
So the revised code is about 200 times faster on this test case.
I have a bit of a logical challenge. I have a single table in excel that contains an identifier column and a cross reference column. There can be multiple rows for a single identifier which indicates multiple cross references. (see basic example below)
Any record that ends in the letter "X" indicates that it is a cross reference, and not an actual identifier. I need to generate a list of the cross references for each identifier, but trace it down to the actual cross reference identifier. So using "A1" as an example from the table above, I would need the list returned as follows "A2,A3,B1,B3". Notice there are no identifiers ending in "X" in the list, they have been traced down to the actual source record through the table.
Any ideas or help would be much appreciated. I'm using python and xlrd to read the table.
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
import itertools
def find_matches(t,key):
return list(itertools.chain(*[[v] if not v.endswith("x") else find_matches(t,v) for k,v in t if k == key]))
print find_matches(t,"a1")
you could treat your list as an adjacency matrix of a graph
something like
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
class MyGraph:
def __init__(self,adjacency_table):
self.table = adjacency_table
self.graph = {}
for from_node,to_node in adjacency_table:
if from_node in self.graph:
self.graph[from_node].append(to_node)
else:
self.graph[from_node] = [to_node]
print self.graph
def find_leaves(self,v):
seen = set(v)
def search(v):
for vertex in self.graph[v]:
if vertex in seen:
continue
seen.add(vertex)
if vertex in self.graph:
for p in search(vertex):
yield p
else:
yield vertex
for p in search(v):
yield p
print list(MyGraph(t).find_leaves("a1"))#,"a1")
I'm working on some code for a directed graph in NetworkX, and have hit a block that's likely the result of my questionable programming experience. What I'm trying to do is the following:
I have a directed graph G, with two "parent nodes" at the top, from which all other nodes flow. When graphing this network, I'd like to graph every node that is a descendant of "Parent 1" one color, and all the other nodes another color. Which means I need a list Parent 1's successors.
Right now, I can get the first layer of them easily using:
descend= G.successors(parent1)
The problem is this only gives me the first generation of successors. Preferably, I want the successors of successors, the successors of the successors of the successors, etc. Arbitrarily, because it would be extremely useful to be able to run the analysis and make the graph without having to know exactly how many generations are in it.
Any idea how to approach this?
You don't need a list of descendents, you just want to color them. For that you just have to pick a algorithm that traverses the graph and use it to color the edges.
For example, you can do
from networkx.algorithms.traversal.depth_first_search import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
See https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.traversal.depth_first_search.dfs_edges.html?highlight=traversal
If you want to get all the successor nodes, without passing through edges, another way could be:
import networkx as nx
G = DiGraph( ... )
successors = nx.nodes(nx.dfs_tree(G, your_node))
I noticed that if you call instead:
successors = list(nx.dfs_successors(G, your_node)
the nodes of the bottom level are somehow not included.
Well, the successor of successor is just the successor of the descendants right?
# First successors
descend = G.successors(parent1)
# 2nd level successors
def allDescendants(d1):
d2 = []
for d in d1:
d2 += G.successors(d)
return d2
descend2 = allDescendants(descend)
To get level 3 descendants, call allDescendants(d2) etc.
Edit:
Issue 1:
allDescend = descend + descend2 gives you the two sets combined, do the same for further levels of descendants.
Issue2: If you have loops in your graph, then you need to first modify the code to test if you've visited that descendant before, e.g:
def allDescendants(d1, exclude):
d2 = []
for d in d1:
d2 += filter(lambda s: s not in exclude, G.successors(d))
return d2
This way, you pass allDescend as the second argument to the above function so it's not included in future descendants. You keep doing this until allDescandants() returns an empty array in which case you know you've explored the entire graph, and you stop.
Since this is starting to look like homework, I'll let you figure out how to piece all this together on your own. ;)
So that the answer is somewhat cleaner and easier to find for future folks who stumble upon it, here's the code I ended up using:
G = DiGraph() # Creates an empty directed graph G
infile = open(sys.argv[1])
for edge in infile:
edge1, edge2 = edge.split() #Splits data on the space
node1 = int(edge1) #Creates integer version of the node names
node2 = int(edge2)
G.add_edge(node1,node2) #Adds an edge between two nodes
parent1=int(sys.argv[2])
parent2=int(sys.argv[3])
data_successors = dfs_successors(G,parent1)
successor_list = data_successors.values()
allsuccessors = [item for sublist in successor_list for item in sublist]
pos = graphviz_layout(G,prog='dot')
plt.figure(dpi=300)
draw_networkx_nodes(G,pos,node_color="LightCoral")
draw_networkx_nodes(G,pos,nodelist=allsuccessors, node_color="SkyBlue")
draw_networkx_edges(G,pos,arrows=False)
draw_networkx_labels(G,pos,font_size=6,font_family='sans-serif',labels=labels)
I believe Networkx has changed since #Jochen Ritzel 's answer a few years ago.
Now the following holds, only changing the import statement.
import networkx
from networkx import dfs_edges
G = DiGraph( ... )
for edge in dfs_edges(G, parent1):
color(edge)
Oneliner:
descendents = sum(nx.dfs_successors(G, parent).values(), [])
nx.descendants(G, parent)
more details: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.descendants.html