Efficient (spatial) Network Neighbors? - python

I would like to identify the Kth order neighbors of an edge on a network, specifically the neighbors of a large set of streets. For example, I have a street that I'm interested in looking at, call this the focal street. For each focal street I want to find the streets that share an intersection, these are the first order neighbors. Then for each of those streets that share an intersection with the focal street I would like to find their neighbors (these would be the second order neighbors), and so on...
Calculating the first order neighbors using ArcGIS' geoprocessing library (arcpy) took 6+ hours, second order neighbors are taking 18+ hours. Needless to say I want to find a more efficient solution. I have created a python dictionary which is keyed on each street and contains the connected streets as values. For example:
st2neighs = {street1: [street2, street3, street5], street2: [street1, street4], ...}.
street 1 is connected to street 2, 3, 5; street 2 is connected to street 1 and 4; etc.. There are around 30,000 streets in the study area most have fewer than 7 connected streets. A pickled version of the data used in the code below IS HERE.
I assumed that knowing the first order neighbors would allow me to efficiently trace the higher order neighbors. But the following code is providing incorrect results:
##Select K-order neighbors from a set of sampled streets.
##saves in dictionary format such that
##the key is the sampled street and the neighboring streets are the values
##################
##IMPORT LIBRARIES
##################
import random as random
import pickle
#######################
##LOAD PICKLED DATA
#######################
seg_file = open("seg2st.pkl", "rb")
st_file = open("st2neighs.pkl", "rb")
seg2st = pickle.load(seg_file)
st2neigh = pickle.load(st_file)
##################
##DEF FUNCTIONS
##################
##Takes in a dict of segments (key) and their streets (values).
##returns the desired number of sampled streets per segment
##returns a dict keyed segment containing tlids.
def selectSample(seg2st, nbirths):
randSt = {}
for segK in seg2st.iterkeys():
ranSamp = [int(random.choice(seg2st[segK])) for i in xrange(nbirths)]
randSt[segK] = []
for aSamp in ranSamp:
randSt[segK].append(aSamp)
return randSt
##Takes in a list of all streets (keys) and their first order neighbors (values)
##Takes in a list of sampled streets
##returns a dict of all sampled streets and their neighbors.
##Higher order selections should be possible with findMoreNeighbors
##logic is the same but replacing sample (input) with output from
##findFirstNeighbors
def findFirstNeighbors(st2neigh, sample):
compSts = {}
for samp in sample.iterkeys():
for rSt in sample[samp]:
if rSt not in compSts:
compSts[rSt] = []
for compSt in st2neigh[rSt]:
compSts[rSt].append(compSt)
return compSts
def findMoreNeighbors(st2neigh, compSts):
for aSt in compSts:
for st in compSts[aSt]:
for nSt in st2neigh[st]:
if nSt not in compSts[aSt]:
compSts[aSt].append(nSt)
moreNeighs = compSts
return moreNeighs
#####################
##The nHoods
#####################
samp = selectSample(seg2st, 1)
n1 = findFirstNeighbors(st2neigh, samp)
n2 = findMoreNeighbors(st2neigh, n1)
n3 = findMoreNeighbors(st2neigh, n2)
#####################
##CHECK RESULTS
#####################
def checkResults(neighList):
cntr = {}
for c in neighList.iterkeys():
cntr[c] = 0
for a in neighList[c]:
cntr[c] += 1
return cntr
##There is an error no streets **should** have 2000+ order neighbors
c1 = checkResults(n1)
c2 = checkResults(n2)
c3 = checkResults(n3)
Help!

It seems to me like what you want to implement is the following: http://en.wikipedia.org/wiki/Composition_of_relations
It is actually a straight-forward algorithm. Let R be the relation "is a first order neighbor", therefore if two streets x,y are in R then x is a first order neighbor of y. So, for second order neighbor you want to compute R composed with R. For third order neighbors (R composed R) composed R. And so on.

Related

Generate random edge in a networkx DiGraph

I have an nx.DiGraph object with a 1000000 nodes and 1500000 edges. The graph is bipartite and it is made like A->T->A where A are number bigger than 99602440 and T are smaller number. Approximately there are the same number of T and A.
I want to generate another graph with the same number of nodes but random edges. the edge should be generated by extracting from the in and out-degree distribution of the original graph.
This is the code I produced:
G=nx.read_weighted_edgelist("C:/Users/uccio/Desktop/tesi2/mygraphUpTo_2011_5_13", create_using=nx.DiGraph,nodetype=int)
#my original graph
def cumul(sort): #create cumulative frequencies
ind=[]
val=[]
c=0
for i, j in sort:
ind.append(i)
c+=j
val.append(c)
return ind, val
def exCum(cum_index, cum_value, sumtot): #binary search in the cumulative
coin=int(random.randint(0,sumtot)) #extract random value
val=np.searchsorted(cum_value,coin,side="left")
return cum_index[val] #return key of the dictionary of cumulated frequencies
distin={} #dictionary of in-degree frequencies
distout={} #dictionary of out-degree frequencies
for f in G.nodes():
if f<99602440:
x=G.in_degree(f)
y=G.out_degree(f)
if x in distin:
distin[x]+=1
else:
distin[x]=1
if y in distout:
distout[y]+=1
else:
distout[y]=1
sort_in=sorted(distin.items()) #ordered to do binary search in the dict
sort_out=sorted(distout.items())
cumin_index, cumin_val= cumul(sort_in) #cumulative value todo binary search in
cumout_index, cumout_val= cumul(sort_out)
sumtot_in=sum(distin.values())
sumtot_out=sum(distout.values())
test=nx.DiGraph()
test.add_nodes_from(G.nodes()) #my new graph
trans=[] #extracted T from the graph
add=[] #extracted A from the graph
for i in test.nodes():
if i<99602440:
trans.append(i)
else:
add.append(i)
for t in trans:
ind=exCum(cumin_index, cumin_val, sumtot_in) #binary search return the key of the dictionary
outd=exCum(cumout_index, cumout_val, sumtot_out)
extin=list(random.choice(add,ind)) #choose random nodes from A to use as input
#it uses ind to choose how many A->T should exist
extout=list(random.choice(add,outd)) #choose random nodes from A to use as output
#it uses ind to choose how many T->A should exist
for inv in extin:
test.add_weighted_edges_from([(inv,t,0)]) #creates the edges A->T for that T
for outv in extout:
test.add_weighted_edges_from([(t,outv,0)]) #creates the edges T->A for that T
The code works but the last part (from for t in trans:) takes a long time (surely up to 5 hours but I went to bed while the computer was working so it could be more).
Is there a way to make everything faster?

getting key error for accessing the graph's weights, networkx

I am using networkx to create an algorithm to calculate the modularity for the different communities. Now I am getting this key problem when I was doing G[complst[i]][complst[j]]['weight'], whereas I printed out complst[i] and compost[j] and find these values are correct. Anyone can help? I tried many ways to debug it such as saving them in seperate variables but they don't help.
import networkx as nx
import copy
#load the graph made in previous task
G = nx.read_gexf("graph.gexf")
#set a global max modualrity value
maxmod = 0
#deep copy of the coriginal graph, since when removing edges, the graph will change
ori = copy.deepcopy(G)
#create an array for saving the edges to remove
arr = []
#see if all edges are broken, if not, keep looping, otherwise stop
while(G.number_of_edges()!=0):
#find the edge_betweeness for each edge
betweeness = nx.edge_betweenness_centrality(G,weight='weight',normalized=False)
print('------------------******************--------------------')
#sort the result in descending order and save all edges with the maximum betweenness to 'arr'
sortbet = {k: v for k, v in sorted(betweeness.items(), key=lambda item: item[1],reverse=True)}
#covert the dict to list for processing
betlst = list(sortbet)
for i in range(len(betlst)):
if betlst[i] == betlst[0]:
arr.append(betlst[i])
#remove all edges with maximum betweeness from the graph
G.remove_edges_from(arr)
#find the leftover component, and convert the result to list for further modualrity processing
lst = list(nx.connected_components(G))
#!!!!!!!!testing and debugging the value, now the value is printed correctly
print(G['pk_sullivan']['ChrisWarcraft']['weight'])
#create a variable cnt to represent modularity in this graph
cnt = 0
#iterate the lst, which is each component(each component is saved as python set)
for n in range(len(lst)):
#convert each component from set to list for processing
complst = list(lst[n])
#if this component is a singleton, the modualrity for this component 0, so add 0 the current cnt
if len(complst)==1:
cnt += 0
else:
# calulate the modularity for this component by using combinations of edges
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
#!!!!!!!!! there is a bunch of my testing and find the value are printed all fine until "print(G[a][b]['weight'])""
print(i)
print(j)
print(complst)
a = complst[i]
print(type(a))
b = complst[j]
print(type(b))
print(G[a][b]['weight'])
#calculate the modualrity by using equation M = 1/2m*(weight(a,b)-degree(a)*degree(b)/2m)
cnt += 1/(2*ori.number_of_edges())*(G[a][b]['weight']-ori.degree(a)*ori.degree(b)/(2*ori.number_of_edges()))
#find the maximum modualrity and save this split of graph, end!
if cnt>=maxmod:
maxmod = cnt
newgraph = copy.deepcopy(G)
print('maxmod is',maxmod)
here is the error, welcome to run the code and hope my code illustration can help!
It looks like you're trying to find the weight of all combinations of nodes within each connected component. But the problem, is that you're assuming that all nodes in a connected component are first degree connected, i.e. are connected through a single edge, which is wrong.
In your code, you have:
...
for i in range(0,len(complst)):
if i+1 <=len(complst)-1:
for j in range(i+1,len(complst)):
...
And then you try to find the weight of the edge that connects these two nodes. But every edge in a connected component is not connected to the rest. A connected component just means that all nodes are reachable from all others.
So you should be iterating over the edges in the subgraph generated by the connected component, or something along these lines.

Improve creation of undirected graph projected from a directed one using Python

I have a (bipartite) directed graph where a legal entity is connected by an edge to each candidate it sponsored or cosponsored. From it, I want a second (unipartite), undirected one, G, projected from the first in which nodes are candidates and the weighted edges connecting them indicate how many times they received money together from the same legal entity.
All information are encoded in a dataframe candidate_donator where each candidate are associated to a tuple containing who donated to him.
I'm using Networkx to create the network and want optimize my implementation because it is taking very long. My original approach is:
candidate_donator = df.groupby('candidate').agg({'donator': lambda x: tuple(set(x))})
import itertools
candidate_pairs= list(itertools.combinations(candidate_donator .index, 2)) #creating all possible unique combinations of pair candidates ~83 M
for cpf1, cpf2 in candidate_pairs:
donators_filter = list(filter(set(candidate_pairs.loc[cpf1]).__contains__, candidate_pairs.loc[cpf2]))
G.add_edge(cpf1, cpf2, weight = len(donators_filter ))
Try this:
#list of donators per candidate
candidate_donator = df.groupby('candidate').agg({'donator': lambda x: tuple(set(x))})
#list of candidates per donator
donator_candidate = df.groupby('donator').agg({'candidate': lambda x: tuple(set(x))})
#for each candidate
for candidate_idx in candidate_donator.index:
#for each donator connected to this candidate
for donator_list in candidate_donator.loc[candidate_idx, 'donator']:
for last_candidate in donator_list:
#existing edge, add weight
if G.has_edge(candidate_idx, last_candidate):
G[candidate_idx][last_candidate] += 0.5
#non existing edge, weight = 0.5 (every edge will be added twice)
else:
G.add_edge(candidate_idx, last_candidate, weight = 0.5)

Variant of Dijkstra - no repeat groups

I'm trying to write an optimization process based on Dijkstra's algorithm to find the optimal path, but with a slight variation to disallow choosing items from the same group/family when finding the optimal path.
Brute force traversal of all edges to find the solution would be np-hard, which is why am attempting to (hopefully) use Dijkstra's algorithm, but I'm struggling to add in the no-repeat groups logic.
Think of it like a traveling salesman problem, but I want to travel from New Your to Los Angels, and have an interesting route (by never visiting 2 similar cities from same group) and minimize my fuel costs. There are approx 15 days and 40 cities, but for defining my program, I've pared it down to 4 cities and 3 days.
Valid paths don't have to visit every group, they just can't visit 2 cities in the same group. {XL,L,S} is a valid solution, but {XL,L,XL} is not valid because it visits the XL group twice. All Valid solutions will be the same length (15 days or edges) but can use any combination of groups (w/out duplicating groups) and need not use them all (since 15 days, but 40 different city groups).
Here's a picture I put together to illustrate a valid & invalid route: (FYI - groups are horizontal rows in the matrix)
**Day 1**
G1->G2 # $10
G3->G4 # $30
etc...
**Day 2**
G1->G3 # $50
G2->G4 # $10
etc...
**Day 3**
G1->G4 # $30
G2->G3 # $50
etc...
The optimal path would be G1->G2->G3, however a standard Dijkstra solution returns G1-
I found & tweaked this example code online, and name my nodes with the following syntax so I can quickly check what day & group they belong to: D[day#][Group#] by slicing the 3rd character.
## Based on code found here: https://raw.githubusercontent.com/nvictus/priority-queue-dictionary/0eea25fa0b0981558aa780ec5b74649af83f441a/examples/dijkstra.py
import pqdict
def dijkstra(graph, source, target=None):
"""
Computes the shortests paths from a source vertex to every other vertex in
a graph
"""
# The entire main loop is O( (m+n) log n ), where n is the number of
# vertices and m is the number of edges. If the graph is connected
# (i.e. the graph is in one piece), m normally dominates over n, making the
# algorithm O(m log n) overall.
dist = {}
pred = {}
predGroups = {}
# Store distance scores in a priority queue dictionary
pq = pqdict.PQDict()
for node in graph:
if node == source:
pq[node] = 0
else:
pq[node] = float('inf')
# Remove the head node of the "frontier" edge from pqdict: O(log n).
for node, min_dist in pq.iteritems():
# Each node in the graph gets processed just once.
# Overall this is O(n log n).
dist[node] = min_dist
if node == target:
break
# Updating the score of any edge's node is O(log n) using pqdict.
# There is _at most_ one score update for each _edge_ in the graph.
# Overall this is O(m log n).
for neighbor in graph[node]:
if neighbor in pq:
new_score = dist[node] + graph[node][neighbor]
#This is my attempt at tracking if we've already used a node in this group/family
#The group designator is stored as the 4th char in the node name for quick access
try:
groupToAdd = node[2]
alreadyVisited = predGroups.get( groupToAdd, False )
except:
alreadyVisited = False
groupToAdd = 'S'
#Solves OK with this line
if new_score < pq[neighbor]:
#Erros out with this line version
#if new_score < pq[neighbor] and not( alreadyVisited ):
pq[neighbor] = new_score
pred[neighbor] = node
#Store this node in the "visited" list to prevent future duplication
predGroups[groupToAdd] = groupToAdd
print predGroups
#print node[2]
return dist, pred
def shortest_path(graph, source, target):
dist, pred = dijkstra(graph, source, target)
end = target
path = [end]
while end != source:
end = pred[end]
path.append(end)
path.reverse()
return path
if __name__=='__main__':
# A simple edge-labeled graph using a dict of dicts
graph = {'START': {'D11':1,'D12':50,'D13':3,'D14':50},
'D11': {'D21':5},
'D12': {'D22':1},
'D13': {'D23':50},
'D14': {'D24':50},
'D21': {'D31':3},
'D22': {'D32':5},
'D23': {'D33':50},
'D24': {'D34':50},
'D31': {'END':3},
'D32': {'END':5},
'D33': {'END':50},
'D34': {'END':50},
'END': {'END':0}}
dist, path = dijkstra(graph, source='START')
print dist
print path
print shortest_path(graph, 'START', 'END')

My A Star implementation won't return the list of steps to get to a destination

I'll try to be brief here. I'm trying to implement A Star on Python, but obviously I'm doing something wrong, because when I test it, it doesn't return the list of steps to get to the destination.
Basically, the context is: I have a map, represented as a graph, formed by nodes. I have a Player class, a Node class, and a Graph class. That doens't matter much, but might be necessary. The player has to get to the nearest node with a Coin in it, which is also a Class.
My implementation is based on the Wikipedia pseudocode, but for some reason it won't work. I'm almost completely sure that my mistake is on A* Star, but i can't find it. Here, i'll put the two functions that i made regarding A Star. Hope it's not too messy, i'm just starting with programming and i like commenting a lot.
I would really appreciate any help to find the problem :)
Note: I'm not an english speaker, so i'm sorry for my mistakes. Wish, in a few years, i'll be able to comunicate better.
def A_Star(player,graph,array_of_available_coins):
# Define the initial position and the last position, where the coin is
initial_position=player.position # Player is a class. Position is of type Node
final_position=closest_cpin(player,graph,array_of_available_coins)
# Define the open_set, closed_set, and work with a Heap.
open_set=[initial_position] # Open_set will be initialized with the current position of the player
closed_set=[]
heapq.heapify(open_set) # Converts the open_set into a Python Heap (or Priority Queue)
came_from={} # It's a dictionary where each key is the a node, and the value is the previous node in the path
# Modify G and H, and therefore F, of the initial position. G of the inicial position is 0.
#And H of the initial position is the pitagoric distance.
initial_position.modify_g_and_h(0,initial_position.distance(final_position))
while open_set!=[]:
square=heapq.heappop(open_set) # Gets the least value of the open_set
if square.is_wall(): # If it's a Wall, the player can't move over it.
continue
if square==final_position:
movements=[] # Creates a empty array to save the movements
rebuild_path(came_from,square,movements) # Calls the function to rebuild the path
player.add_movements_array(movements) # Copies the movements into the movements array of the player
return
# In this point, the square is not a wall and it's not the final_position
closed_set.append(square) # Add the square into the closed_set
neighbours=graph.see_neighbours(square) # Checks all the neighbours of the current square
for neigh in neighbours:
if neigh.is_wall()==True:
continue
if neigh in closed_set:
continue
# Calculates the new G, H and F values
g_aux=square.see_g()+square.get_cost(neigh) # Current g + the cost to get from current to neighbour
h_aux=neigh.distance(final_position) # Pitagoric distance between the neighbour and the last position
f_aux=g_aux+h_aux # F=G+H
if neigh not in open_set:
heapq.heappush(open_set,neigh) # Adds the neigh into the open_set
is_better=True
elif f_aux<neigh.see_f():
is_better=True
else:
is_better=False
if is_better==True:
came_from[neigh]=square # The actual neigh came from the actual square
neigh.modify_g_and_h(g_aux,h_aux) #Modifies the g and h values of the actual neighbour
return None
def rebuild_path(came_from,square,array_of_movements):
array_of_movements.insert(0,square) # Adds, in the first position of the array, the square it gets by parameter
if not square in came_from: # if there is no key "square" in the came_from dictionary, then it's the first position
array_of_movements.remove(array_of_movements[0]) # Gets the first element of the array out (because i need it to be that way later)
return array_of_movements
rebuild_path(came_from,came_from[square],array_of_movements)
return
The thing is, i have to implement the algorithm, because it's part of an Excercise (much larger, with Pygame and everything), and this is the only thing that's making me nervous. If i use a library, it'll count as if i didn't do it, so i'll have to deliver it again :(
I would recommend networkx
import networkx
this can do this kind of stuff:
#!/usr/bin/env python
# encoding: utf-8
"""
Example of creating a block model using the blockmodel function in NX. Data used is the Hartford, CT drug users network:
#article{,
title = {Social Networks of Drug Users in {High-Risk} Sites: Finding the Connections},
volume = {6},
shorttitle = {Social Networks of Drug Users in {High-Risk} Sites},
url = {http://dx.doi.org/10.1023/A:1015457400897},
doi = {10.1023/A:1015457400897},
number = {2},
journal = {{AIDS} and Behavior},
author = {Margaret R. Weeks and Scott Clair and Stephen P. Borgatti and Kim Radda and Jean J. Schensul},
month = jun,
year = {2002},
pages = {193--206}
}
"""
__author__ = """\n""".join(['Drew Conway <drew.conway#nyu.edu>',
'Aric Hagberg <hagberg#lanl.gov>'])
from collections import defaultdict
import networkx as nx
import numpy
from scipy.cluster import hierarchy
from scipy.spatial import distance
import matplotlib.pyplot as plt
def create_hc(G):
"""Creates hierarchical cluster of graph G from distance matrix"""
path_length=nx.all_pairs_shortest_path_length(G)
distances=numpy.zeros((len(G),len(G)))
for u,p in path_length.items():
for v,d in p.items():
distances[u][v]=d
# Create hierarchical cluster
Y=distance.squareform(distances)
Z=hierarchy.complete(Y) # Creates HC using farthest point linkage
# This partition selection is arbitrary, for illustrive purposes
membership=list(hierarchy.fcluster(Z,t=1.15))
# Create collection of lists for blockmodel
partition=defaultdict(list)
for n,p in zip(list(range(len(G))),membership):
partition[p].append(n)
return list(partition.values())
if __name__ == '__main__':
G=nx.read_edgelist("hartford_drug.edgelist")
# Extract largest connected component into graph H
H=nx.connected_component_subgraphs(G)[0]
# Makes life easier to have consecutively labeled integer nodes
H=nx.convert_node_labels_to_integers(H)
# Create parititions with hierarchical clustering
partitions=create_hc(H)
# Build blockmodel graph
BM=nx.blockmodel(H,partitions)
# Draw original graph
pos=nx.spring_layout(H,iterations=100)
fig=plt.figure(1,figsize=(6,10))
ax=fig.add_subplot(211)
nx.draw(H,pos,with_labels=False,node_size=10)
plt.xlim(0,1)
plt.ylim(0,1)
# Draw block model with weighted edges and nodes sized by number of internal nodes
node_size=[BM.node[x]['nnodes']*10 for x in BM.nodes()]
edge_width=[(2*d['weight']) for (u,v,d) in BM.edges(data=True)]
# Set positions to mean of positions of internal nodes from original graph
posBM={}
for n in BM:
xy=numpy.array([pos[u] for u in BM.node[n]['graph']])
posBM[n]=xy.mean(axis=0)
ax=fig.add_subplot(212)
nx.draw(BM,posBM,node_size=node_size,width=edge_width,with_labels=False)
plt.xlim(0,1)
plt.ylim(0,1)
plt.axis('off')
plt.savefig('hartford_drug_block_model.png')

Categories