I have a (bipartite) directed graph where a legal entity is connected by an edge to each candidate it sponsored or cosponsored. From it, I want a second (unipartite), undirected one, G, projected from the first in which nodes are candidates and the weighted edges connecting them indicate how many times they received money together from the same legal entity.
All information are encoded in a dataframe candidate_donator where each candidate are associated to a tuple containing who donated to him.
I'm using Networkx to create the network and want optimize my implementation because it is taking very long. My original approach is:
candidate_donator = df.groupby('candidate').agg({'donator': lambda x: tuple(set(x))})
import itertools
candidate_pairs= list(itertools.combinations(candidate_donator .index, 2)) #creating all possible unique combinations of pair candidates ~83 M
for cpf1, cpf2 in candidate_pairs:
donators_filter = list(filter(set(candidate_pairs.loc[cpf1]).__contains__, candidate_pairs.loc[cpf2]))
G.add_edge(cpf1, cpf2, weight = len(donators_filter ))
Try this:
#list of donators per candidate
candidate_donator = df.groupby('candidate').agg({'donator': lambda x: tuple(set(x))})
#list of candidates per donator
donator_candidate = df.groupby('donator').agg({'candidate': lambda x: tuple(set(x))})
#for each candidate
for candidate_idx in candidate_donator.index:
#for each donator connected to this candidate
for donator_list in candidate_donator.loc[candidate_idx, 'donator']:
for last_candidate in donator_list:
#existing edge, add weight
if G.has_edge(candidate_idx, last_candidate):
G[candidate_idx][last_candidate] += 0.5
#non existing edge, weight = 0.5 (every edge will be added twice)
else:
G.add_edge(candidate_idx, last_candidate, weight = 0.5)
Related
I have an nx.DiGraph object with a 1000000 nodes and 1500000 edges. The graph is bipartite and it is made like A->T->A where A are number bigger than 99602440 and T are smaller number. Approximately there are the same number of T and A.
I want to generate another graph with the same number of nodes but random edges. the edge should be generated by extracting from the in and out-degree distribution of the original graph.
This is the code I produced:
G=nx.read_weighted_edgelist("C:/Users/uccio/Desktop/tesi2/mygraphUpTo_2011_5_13", create_using=nx.DiGraph,nodetype=int)
#my original graph
def cumul(sort): #create cumulative frequencies
ind=[]
val=[]
c=0
for i, j in sort:
ind.append(i)
c+=j
val.append(c)
return ind, val
def exCum(cum_index, cum_value, sumtot): #binary search in the cumulative
coin=int(random.randint(0,sumtot)) #extract random value
val=np.searchsorted(cum_value,coin,side="left")
return cum_index[val] #return key of the dictionary of cumulated frequencies
distin={} #dictionary of in-degree frequencies
distout={} #dictionary of out-degree frequencies
for f in G.nodes():
if f<99602440:
x=G.in_degree(f)
y=G.out_degree(f)
if x in distin:
distin[x]+=1
else:
distin[x]=1
if y in distout:
distout[y]+=1
else:
distout[y]=1
sort_in=sorted(distin.items()) #ordered to do binary search in the dict
sort_out=sorted(distout.items())
cumin_index, cumin_val= cumul(sort_in) #cumulative value todo binary search in
cumout_index, cumout_val= cumul(sort_out)
sumtot_in=sum(distin.values())
sumtot_out=sum(distout.values())
test=nx.DiGraph()
test.add_nodes_from(G.nodes()) #my new graph
trans=[] #extracted T from the graph
add=[] #extracted A from the graph
for i in test.nodes():
if i<99602440:
trans.append(i)
else:
add.append(i)
for t in trans:
ind=exCum(cumin_index, cumin_val, sumtot_in) #binary search return the key of the dictionary
outd=exCum(cumout_index, cumout_val, sumtot_out)
extin=list(random.choice(add,ind)) #choose random nodes from A to use as input
#it uses ind to choose how many A->T should exist
extout=list(random.choice(add,outd)) #choose random nodes from A to use as output
#it uses ind to choose how many T->A should exist
for inv in extin:
test.add_weighted_edges_from([(inv,t,0)]) #creates the edges A->T for that T
for outv in extout:
test.add_weighted_edges_from([(t,outv,0)]) #creates the edges T->A for that T
The code works but the last part (from for t in trans:) takes a long time (surely up to 5 hours but I went to bed while the computer was working so it could be more).
Is there a way to make everything faster?
Let's say I have an unweighted directed graph. I was wondering if there was a way to store all the distances between a starting node and all the remaining nodes of the graph. I know Dijkstra's algorithm could be an option, but I'm not sure this would be the best one, since I'm working with a pretty big graph (~100k nodes), and it is an unweighted one. My toughts so far were to perform a BFS, trying to store all the distances meanwhile. Is this a feasible approach?
Finally, since I'm pretty new on graph theory, could someone maybe point me in the right direction for a good Python implementation of this kind of problem?
Definitely feasible, and pretty fast if your data structure contains a list of end nodes for each starting node indexed on the starting node identifier:
Here's an example using a dictionary for edges: {startNode:list of end nodes}
from collections import deque
maxDistance = 0
def getDistances(origin,edges):
global maxDistance
maxDistance = 0
distances = {origin:0} # {endNode:distance from origin}
toLink = deque([origin]) # start at origin (distance=0)
while toLink:
start = toLink.popleft() # previous end, will chain to next
dist = distances[start] + 1 # new next are at +1
for end in edges[start]: # next end nodes
if end in distances: continue # new ones only
distances[end] = dist # record distance
toLink.append(end) # will link from there
maxDistance = max(maxDistance,dist)
return distances
This does one iteration per node (excluding unreachable nodes) and uses fast dictionary access to follow links to new next nodes
Using some random test data (10 million edges) ...
import random
from collections import defaultdict
print("loading simulated graphs")
vertexCount = 100000
edgeCount = vertexCount * 100
edges = defaultdict(set)
edgesLoaded = 0
minSpan = 1 # vertexCount//2
while edgesLoaded<edgeCount:
start = random.randrange(vertexCount)
end = random.randrange(vertexCount)
if abs(start-end) > minSpan and end not in edges[start]:
edges[start].add(end)
edgesLoaded += 1
print("loaded!")
Performance:
# starting from a randomly selected node
origin = random.choice(list(edges.keys()))
from timeit import timeit
t = timeit(lambda:getDistances(origin,edges),number=1)
print(f"{t:.2f} seconds for",edgeCount,"edges", "max distance = ",maxDistance)
# 3.06 seconds for 10000000 edges max distance = 4
I am using networkx to create an algorithm to calculate the modularity for the different communities. Now I am getting this key problem when I was doing G[complst[i]][complst[j]]['weight'], whereas I printed out complst[i] and compost[j] and find these values are correct. Anyone can help? I tried many ways to debug it such as saving them in seperate variables but they don't help.
import networkx as nx
import copy
#load the graph made in previous task
G = nx.read_gexf("graph.gexf")
#set a global max modualrity value
maxmod = 0
#deep copy of the coriginal graph, since when removing edges, the graph will change
ori = copy.deepcopy(G)
#create an array for saving the edges to remove
arr = []
#see if all edges are broken, if not, keep looping, otherwise stop
while(G.number_of_edges()!=0):
#find the edge_betweeness for each edge
betweeness = nx.edge_betweenness_centrality(G,weight='weight',normalized=False)
print('------------------******************--------------------')
#sort the result in descending order and save all edges with the maximum betweenness to 'arr'
sortbet = {k: v for k, v in sorted(betweeness.items(), key=lambda item: item[1],reverse=True)}
#covert the dict to list for processing
betlst = list(sortbet)
for i in range(len(betlst)):
if betlst[i] == betlst[0]:
arr.append(betlst[i])
#remove all edges with maximum betweeness from the graph
G.remove_edges_from(arr)
#find the leftover component, and convert the result to list for further modualrity processing
lst = list(nx.connected_components(G))
#!!!!!!!!testing and debugging the value, now the value is printed correctly
print(G['pk_sullivan']['ChrisWarcraft']['weight'])
#create a variable cnt to represent modularity in this graph
cnt = 0
#iterate the lst, which is each component(each component is saved as python set)
for n in range(len(lst)):
#convert each component from set to list for processing
complst = list(lst[n])
#if this component is a singleton, the modualrity for this component 0, so add 0 the current cnt
if len(complst)==1:
cnt += 0
else:
# calulate the modularity for this component by using combinations of edges
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
#!!!!!!!!! there is a bunch of my testing and find the value are printed all fine until "print(G[a][b]['weight'])""
print(i)
print(j)
print(complst)
a = complst[i]
print(type(a))
b = complst[j]
print(type(b))
print(G[a][b]['weight'])
#calculate the modualrity by using equation M = 1/2m*(weight(a,b)-degree(a)*degree(b)/2m)
cnt += 1/(2*ori.number_of_edges())*(G[a][b]['weight']-ori.degree(a)*ori.degree(b)/(2*ori.number_of_edges()))
#find the maximum modualrity and save this split of graph, end!
if cnt>=maxmod:
maxmod = cnt
newgraph = copy.deepcopy(G)
print('maxmod is',maxmod)
here is the error, welcome to run the code and hope my code illustration can help!
It looks like you're trying to find the weight of all combinations of nodes within each connected component. But the problem, is that you're assuming that all nodes in a connected component are first degree connected, i.e. are connected through a single edge, which is wrong.
In your code, you have:
...
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
...
And then you try to find the weight of the edge that connects these two nodes. But every edge in a connected component is not connected to the rest. A connected component just means that all nodes are reachable from all others.
So you should be iterating over the edges in the subgraph generated by the connected component, or something along these lines.
I am reading graphs such as http://www.dis.uniroma1.it/challenge9/data/rome/rome99.gr from http://www.dis.uniroma1.it/challenge9/download.shtml in python. For example, using this code.
#!/usr/bin/python
from igraph import *
fname = "rome99.gr"
g = Graph.Read_DIMACS(fname, directed=True )
(I need to change the line "p sp 3353 8870" " to "p max 3353 8870" to get this to work using igraph.)
I would like to convert the graph to one where all nodes have outdegree 1 (except for extra zero weight edges we are allowed to add) but still preserve all shortest paths. That is a path between two nodes in the original graph should be a shortest path in the new graph if and only if it is a shortest path in the converted graph. I will explain this a little more after an example.
One way to do this I was thinking is to replace each node v by a little linear subgraph with v.outdegree(mode=OUT) nodes. In the subgraph the nodes are connected in sequence by zero weight edges. We then connect nodes in the subgraph to the first node in other little subgraphs we have created.
I don't mind using igraph or networkx for this task but I am stuck with the syntax of how to do it.
For example, if we start with graph G:
I would like to convert it to graph H:
As the second graph has more nodes than the first we need to define what we mean by its having the same shortest paths as the first graph. I only consider paths between either nodes labelled with simple letters of with nodes labelled X1. In other words, in this example a path can't start or end in A2 or B2. We also merge all versions of a node when considering a path. So a path A1->A2->D in H is regarded as the same as A->D in G.
This is how far I have got. First I add the zero weight edges to the new graph
h = Graph(g.ecount(), directed=True)
#Connect the nodes with zero weight edges
gtoh = [0]*g.vcount()
i=0
for v in g.vs:
gtoh[v.index] = i
if (v.degree(mode=OUT) > 1):
for j in xrange(v.degree(mode=OUT)-1):
h.add_edge(i,i+1, weight = 0)
i = i+1
i = i + 1
Then I add the main edges
#Now connect the nodes to the relevant "head" nodes.
for v in g.vs:
h_v_index = gtoh[v.index]
i = 0
for neighbour in g.neighbors(v, mode=OUT):
h.add_edge(gtoh[v.index]+i,gtoh[neighbour], weight = g.es[g.get_eid(v.index, neighbour)]["weight"])
i = i +1
Is there a nicer/better way of doing this? I feel there must be.
The following code should work in igraph and Python 2.x; basically it does what you proposed: it creates a "linear subgraph" for every single node in the graph, and connects exactly one outgoing edge to each node in the linear subgraph corresponding to the old node.
#!/usr/bin/env python
from igraph import Graph
from itertools import izip
def pairs(l):
"""Given a list l, returns an iterable that yields pairs of the form
(l[i], l[i+1]) for all possible consecutive pairs of items in l"""
return izip(l, l[1:])
def convert(g):
# Get the old vertex names from g
if "name" in g.vertex_attributes():
old_names = map(str, g.vs["name"])
else:
old_names = map(str, xrange(g.vcount))
# Get the outdegree vector of the old graph
outdegs = g.outdegree()
# Create a mapping from old node IDs to the ID of the first node in
# the linear subgraph corresponding to the old node in the new graph
new_node_id = 0
old_to_new = []
new_names = []
for old_node_id in xrange(g.vcount()):
old_to_new.append(new_node_id)
new_node_id += outdegs[old_node_id]
old_name = old_names[old_node_id]
if outdegs[old_node_id] <= 1:
new_names.append(old_name)
else:
for i in xrange(1, outdegs[old_node_id]+1):
new_names.append(old_name + "." + str(i))
# Add a sentinel element to old_to_new just to make our job easier
old_to_new.append(new_node_id)
# Create the edge list of the new graph and the weights of the new
# edges
new_edgelist = []
new_weights = []
# 1) Create the linear subgraphs
for new_node_id, next_new_node_id in pairs(old_to_new):
for source, target in pairs(range(new_node_id, next_new_node_id)):
new_edgelist.append((source, target))
new_weights.append(0)
# 2) Create the new edges based on the old ones
for old_node_id in xrange(g.vcount()):
new_node_id = old_to_new[old_node_id]
for edge_id in g.incident(old_node_id, mode="out"):
neighbor = g.es[edge_id].target
new_edgelist.append((new_node_id, old_to_new[neighbor]))
new_node_id += 1
print g.es[edge_id].source, g.es[edge_id].target, g.es[edge_id]["weight"]
new_weights.append(g.es[edge_id]["weight"])
# Return the graph
vertex_attrs = {"name": new_names}
edge_attrs = {"weight": new_weights}
return Graph(new_edgelist, directed=True, vertex_attrs=vertex_attrs, \
edge_attrs=edge_attrs)
I would like to identify the Kth order neighbors of an edge on a network, specifically the neighbors of a large set of streets. For example, I have a street that I'm interested in looking at, call this the focal street. For each focal street I want to find the streets that share an intersection, these are the first order neighbors. Then for each of those streets that share an intersection with the focal street I would like to find their neighbors (these would be the second order neighbors), and so on...
Calculating the first order neighbors using ArcGIS' geoprocessing library (arcpy) took 6+ hours, second order neighbors are taking 18+ hours. Needless to say I want to find a more efficient solution. I have created a python dictionary which is keyed on each street and contains the connected streets as values. For example:
st2neighs = {street1: [street2, street3, street5], street2: [street1, street4], ...}.
street 1 is connected to street 2, 3, 5; street 2 is connected to street 1 and 4; etc.. There are around 30,000 streets in the study area most have fewer than 7 connected streets. A pickled version of the data used in the code below IS HERE.
I assumed that knowing the first order neighbors would allow me to efficiently trace the higher order neighbors. But the following code is providing incorrect results:
##Select K-order neighbors from a set of sampled streets.
##saves in dictionary format such that
##the key is the sampled street and the neighboring streets are the values
##################
##IMPORT LIBRARIES
##################
import random as random
import pickle
#######################
##LOAD PICKLED DATA
#######################
seg_file = open("seg2st.pkl", "rb")
st_file = open("st2neighs.pkl", "rb")
seg2st = pickle.load(seg_file)
st2neigh = pickle.load(st_file)
##################
##DEF FUNCTIONS
##################
##Takes in a dict of segments (key) and their streets (values).
##returns the desired number of sampled streets per segment
##returns a dict keyed segment containing tlids.
def selectSample(seg2st, nbirths):
randSt = {}
for segK in seg2st.iterkeys():
ranSamp = [int(random.choice(seg2st[segK])) for i in xrange(nbirths)]
randSt[segK] = []
for aSamp in ranSamp:
randSt[segK].append(aSamp)
return randSt
##Takes in a list of all streets (keys) and their first order neighbors (values)
##Takes in a list of sampled streets
##returns a dict of all sampled streets and their neighbors.
##Higher order selections should be possible with findMoreNeighbors
##logic is the same but replacing sample (input) with output from
##findFirstNeighbors
def findFirstNeighbors(st2neigh, sample):
compSts = {}
for samp in sample.iterkeys():
for rSt in sample[samp]:
if rSt not in compSts:
compSts[rSt] = []
for compSt in st2neigh[rSt]:
compSts[rSt].append(compSt)
return compSts
def findMoreNeighbors(st2neigh, compSts):
for aSt in compSts:
for st in compSts[aSt]:
for nSt in st2neigh[st]:
if nSt not in compSts[aSt]:
compSts[aSt].append(nSt)
moreNeighs = compSts
return moreNeighs
#####################
##The nHoods
#####################
samp = selectSample(seg2st, 1)
n1 = findFirstNeighbors(st2neigh, samp)
n2 = findMoreNeighbors(st2neigh, n1)
n3 = findMoreNeighbors(st2neigh, n2)
#####################
##CHECK RESULTS
#####################
def checkResults(neighList):
cntr = {}
for c in neighList.iterkeys():
cntr[c] = 0
for a in neighList[c]:
cntr[c] += 1
return cntr
##There is an error no streets **should** have 2000+ order neighbors
c1 = checkResults(n1)
c2 = checkResults(n2)
c3 = checkResults(n3)
Help!
It seems to me like what you want to implement is the following: http://en.wikipedia.org/wiki/Composition_of_relations
It is actually a straight-forward algorithm. Let R be the relation "is a first order neighbor", therefore if two streets x,y are in R then x is a first order neighbor of y. So, for second order neighbor you want to compute R composed with R. For third order neighbors (R composed R) composed R. And so on.