OSMNX graph to distance matrix and DBSCAN - python

I'm having troubles programming to get the distance matrix from a Graph of coordinates (LAT,LON).
I want to connect an arbitrarily group of points (let's say, 200.000 firms), get their nearests representations in the Graph, created with ox.graph_from_place().
I am working with dask arrays and dataframes (da.array, df.DataFrame)
if __name__ == "__main__":
# OPTION B: Use a strongly (instead of weakly) connected graph
Gs = ox.utils_graph.get_largest_component(G, strongly=True)
Gs.__name__ = "Gs"
# attach nearest network node to each firm
df["nn"] = da.array(ox.get_nearest_nodes(Gs, X=df['longitude'], Y=df['latitude'], method='balltree') )
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(df['nn'].unique())
nodes_unique.index = nodes_unique.values
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(Gs)
G_dm.__name__ = "G_dm"
save_pickle(Gs)
save_pickle(G_dm)
print("len df['nn']:", len(df['nn']))
print("len nodes_unique:", len(nodes_unique))
Some code has been avoided to go to the heart of the matter, then, I've tried following https://networkx.org/documentation/stable/reference/algorithms/shortest_paths.html and a network_distance_matrix() function using a strongly connected graph Gs, but this is so inefficient in computation time. I have seen in the docs a bunch of functions, but I haven't seen one to efficiently compute a distance-matrix between pairs of unique nodes belonging to the Graph and following paths through it.
I would like to know if there is any way to parallelize this process, and/or make it go through a generative way and not storing all that RAM.
My objective is to provide a pre-computed matrix in a DBSCAN (sklearn.clustering) model, and it has to be quickly so that I can kind of "grid-search" through parameters. I am a beginner to these libraries.

Related

Creating a directed scale-free graph with row-stochastic adjacency matrix using Networkx

As part of my dissertation in the field of behavioural economics, I started to work with social networks and opinion dynamics.
For a simulation-based study, I currently require a directed scale-free network featuring a row-stochastic adjacency matrix in order to perform calculations with the Degroot opinion dynamics model.
The aim is to generate a directed scale-free network in which the nodes with the highest out-degree affect many other agents within their network hub, but are influenced themselves only a bit (ingoing weights are still > 0 since I need a positive sum of the respective row in the adjacency matrix).
You can think about the network as a stylized Twitter network where you have a few strongly connected nodes affecting many other nodes, but are not themselves influenced by others.
The problem was that after normalization the network was no longer perceived as directed.In the following my code.
In a first step, I used the Networkx package to generate a scale-free graph and converted the graph object into an adjacency matrix:
G = nx.scale_free_graph(100)
nx.is_directed(G)
Output: True
Subsequently, I normalised the underlying adjacency (e.g., row-stochastic) and converted it back into a graph.
A = nx.to_numpy_array(G)
A_normalized = normalize(A, axis=1, norm='l1')
G_new = nx.from_numpy_matrix(A_normalized)
nx.is_directed(G_new)
Output: False
Can someone explain to me why this is the case or what I can change to make my normalised network count as directed again?
The directed scale-free network is a MultiDiGraph. All you need to do is using the create_using parameter when creating the new graph from the numpy array:
import networkx as nx
G = nx.scale_free_graph(100)
A = nx.to_numpy_array(G)
A_normalized = normalize(A, axis=1, norm='l1')
G_new = nx.from_numpy_array(A_normalized, create_using=nx.MultiDiGraph)

Retrieving MST for geographic coordinates using scipy minimum spanning tree

I am trying to create a minimum spanning tree (MST) from geographical coordinates using scipy, but for the life of me I cannot understand how to extract information to it . The scipy documentation is not very clear and multiple searches have not provided results.
For context in total I have around 200k datapoints per set and they look like this
My final objective is to create a line vector that connects these points through the MST, more or less as they appear in the image above. But for that I need an ordered list of point indices (or coordinates) I can work with.
Most of all I would need help understanding how to use the output of minimum_spanning_tree but it might be that I am making mistakes along the path
Overall steps
The steps I take are:
Create the sparse matrix with coordinate info
provide the matrix to scipy.sparse.csgraph.minimum_spanning_tree
Do some magic to extract column values
This is the small sample test data:
test_data = {
"index": [0,1,2,3,4],
"X": [35,36,37,38,38],
"Y": [2113,2113,2112,2101,2102]
}
df= pd.DataFrame(test_data)
Step 1, create the sparse matrix
xs = df[["X"]].values.squeeze().astype(int)
ys = df[["Y"]].values.squeeze().astype(int)
data= np.array(df.index).squeeze().astype(int)
max_dim =max(np.max(xs), np.max(ys)) +1
max_dim
dist_matr=csr_matrix((data, (xs,ys)),shape=(max_dim, max_dim))
Q1:I couldn't understand what data is in this context as scipy docs do not explain that in detail. should {data} be the labels of the points or are they the edge weights?
Step2: calculate the minimum spanning tree
mst = minimum_spanning_tree(dist_matr)
Step3: get an ordered list of indices (or coordinates)
As I understand it the output of MST is a sparse graph that should look something like this (source)
Q2: However, my matrix is not 5X5, but max_value*max_value (2113 in this case). And it seems like the content of the matrix is not the edge weight. Am I getting this wrong?
I have tried to extract the connected components, but the labels don't make sense to me
# Label connected components.
num_graphs, labels = connected_components(mst, directed=False)
# This is a snippet I found somewhere but I have difficulties following the logic of it
results = [[] for i in range(max(labels) + 1)]
for idx, label in enumerate(labels):
results[label].append(idx)
portion of the results:
As you can see point coordinates are grouped in an odd way, without a relationship between x and y. I have also tried 'depth_first_order' but aside for asking a starting point (that I wouldn't know how to choose) it provides me with equally confusing outputs
Q4: How do I "read" the MST matrix and extract the minimum spanning tree for all points?
I am happy to explore other solutions as long as they provide a similar result and are scalable, however I have seen concerns about NetworkX for lots of data and MisTree doesn't install on my setup

Plot a matrix as a single point in space

I have a dataset of drugs represented as a graph, each of which is described by three non-square matrices:
edge index (A), an 2xe matrix, where e are the bonds of the molecule, the first line indicates the node (atom) from which the edge (bond) starts, and the second one the node where the edge arrives;
node feature matrix (X), an nx9 matrix, where n are the atoms of the molecule and 9 are the features used to describe these (e.g. atomic number, charge, hybridization);
edge feature matrix (E), an 4xe matrix, where e are the bonds of the molecule and 4 are the features used to describe these (e.g. type of bond, geometry).
I would like to plot these data on a Cartesian space to see if clusters are created based on their activity label. I thought, if I can reduce each matrix to a single point in space for each graph I will have three x, y, z coordinates, and then it will be very easy to plot the points. Does this make sense in your opinion? How could I go about turning a matrix into a single point using python? Finally, I leave you with an example of the graph I would like to create
Thank you all!
Assuming:
The nodes in a drug's graph represent features that every drug has to different extents, including zero.
The structure of a drug's graph models the extent to which every feature applies to that that drug
There is an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature applies the the drug
Then:
Construct a table where each row models a drug and each column is for a feature. Each cell then contains the "extent" to which the column's feature applies to the row's drug.
Apply the K-Means algorithm to the table.
The challenge is, of course: an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature.
IMHO the first step is to enter your data into a graph theory library. I see you are using Python. Python folks generally use a library called networkx. Are you familiar with this library?
Personally, I much prefer to work with C++ ( it gives the performance required for large problem sets ) Recently, I added a SMILES parser to my C++ graph library.
Convert the SMILES representation of each drug to its graph representation
Calculate the graph edit distance ( GED https://en.wikipedia.org/wiki/Graph_edit_distance ) between every pair of drugs
LOOP GEDMAX from 1 to 10
Add a connection between two drugs if the GED is less than GEDMAX. This forms a new graph we can call "GEDgraph"
Find the components ( clusters of drugs all reachable from each other in the GEDgraph )
SELECT "best" set of components

Efficiently generating random graphs with a user-specified global clustering coefficient

I'm working on simulations of large-scale neuronal networks, for which I need to generate random graphs that represent the network topology.
I'd like to be able to specify the following properties of these graphs:
Number of nodes, N (~=1000-10000)
Average probability of a connection between any two given nodes, p (~0.01-0.2)
Global clustering coefficient, C (~0.1-0.5)
Ideally, the random graphs should be drawn uniformly from the set of all possible graphs that satisfy these user-specified criteria.
At the moment I'm using a very crude random diffusion approach where I start out with an Erdos-Renyi random network with the desired size and global connection probability, then on each step I randomly rewire some fraction of the edges. If the rewiring got me closer to the desired C then I keep the rewired network into the next iteration.
Here's my current Python implementation:
import igraph
import numpy as np
def generate_fixed_gcc(n, p, target_gcc, tol=1E-3):
"""
Creates an Erdos-Renyi random graph of size n with a specified global
connection probability p, which is then iteratively rewired in order to
achieve a user- specified global clustering coefficient.
"""
# initialize random graph
G_best = igraph.Graph.Erdos_Renyi(n=n, p=p, directed=True, loops=False)
loss_best = 1.
n_edges = G_best.ecount()
# start with a high rewiring rate
rewiring_rate = n_edges
n_iter = 0
while loss_best > tol:
# operate on a copy of the current best graph
G = G_best.copy()
# adjust the number of connections to rewire according to the current
# best loss
n_rewire = min(max(int(rewiring_rate * loss_best), 1), n_edges)
G.rewire(n=n_rewire)
# compute the global clustering coefficient
gcc = G.transitivity_undirected()
loss = abs(gcc - target_gcc)
# did we improve?
if loss < loss_best:
# keep the new graph
G_best = G
loss_best = loss
gcc_best = gcc
# increase the rewiring rate
rewiring_rate *= 1.1
else:
# reduce the rewiring rate
rewiring_rate *= 0.9
n_iter += 1
# get adjacency matrix as a boolean numpy array
M = np.array(G_best.get_adjacency().data, dtype=np.bool)
return M, n_iter, gcc_best
This is works OK for small networks (N < 500), but it quickly becomes intractable as the number of nodes increases. It takes on the order of about 20 sec to generate a 200 node graph, and several days to generate a 1000 node graph.
Can anyone suggest an efficient way to do this?
You are right. That is a very expensive method to achieve what you want. I can only speculate if there is a mathematically sound way to optimize and ensure that it is close to being a uniform distribution. I'm not even sure that your method leads to a uniform distribution although it seems like it would. Let me try:
Based on the docs for transitivity_undirected and wikipedia Clustering Coefficient, it sounds like it is possible to make changes locally in the graph and at the same time know the exact effect on global connectivity and global clustering.
The global clustering coefficient is based on triplets of nodes. A triplet consists of three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centred on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed).
( * edit * ) Based on my reading of the paper referenced by ali_m, the method below will probably spend too many edges on low-degree clusters, leading to a graph that cannot achieve the desired clustering coefficient unless it is very low (which probably wouldn't be useful anyway). Therefore, on the off chance that somebody actually uses this, you will want to identify higher degree clusters to add edges to in order to quickly raise the clustering coefficient without needing to add a lot of edges.
On the other hand, the method below does align with the methods in the research paper so it's more or less a reasonable approach.
If I understand it correctly, you could do the following:
Produce the graph as you have done.
Calculate and Track:
p_surplus to track the number of edges that need to be added or removed elsewhere to maintain connectivity
cc_top, cc_btm to track the clustering coefficient
Iteratively (not completely) choose random pairs and connect or disconnect them to monotonically
approach the Clustering Coefficient (cc) you want while maintaining the Connectivity (p) you already have.
Pseudo code:
for random_pair in random_pairs:
if (random_pair is connected) and (need to reduce cc or p): # maybe put priority on the one that has a larger gap?
delete the edge
p_surplus -= 1
cc_top -= broken_connected_triplets # have to search locally
cc_btm -= (broken_connected_triplets + broken_open_triplets) # have to search locally
elif (random_pair is not connected) add (need to increase c or p):
add the edge
p_surplus += 1
cc_top += new_connected_triplets
cc_btm += (new_connected_triplets + new_open_triplets)
if cc and p are within desired ranges:
done
if some condition for detecting infinite loops:
rethink this method
That may not be totally correct, but I think the approach will work. The efficiency
of searching for local triplets and always moving your parameters in the right direction will be better
than copying the graph and globally measuring the cc so many times.
Having done a bit of reading, it looks as though the best solution might be the generalized version of Gleeson's algorithm presented in this paper. However, I still don't really understand how to implement it, so for the time being I've been working on Bansal et al's algorithm.
Like my naive approach, this is a Markov chain-based method that uses random edge swaps, but unlike mine it specifically targets 'triplet motifs' within the graph for rewiring:
Since this will have a greater tendency to introduce triangles, it will therefore have a greater impact on the clustering coefficient. At least in the case of undirected graphs, the rewiring step is also guaranteed to preserve the degree sequence. Again, on every rewiring iteration the new global clustering coefficient is measured, and the new graph is accepted if the GCC got closer to the target value.
Bansal et al actually provided a Python implementation, but for various reasons I ended up writing my own version, which you can find here.
Performance
The Bansal approach takes just over half the number of iterations and half the total time compared with my naive diffusion method:
I was hoping for bigger gains, but a 2x speedup is better than nothing.
Generalizing to directed graphs
One remaining challenge with the Bansal method is that my graphs are directed, whereas Bansal et al's algorithm is only designed to work on undirected graphs. With a directed graph, the rewiring step is no longer guaranteed to preserve the in- and out-degree sequences.
Update
I've just figured out how to generalize the Bansal method to preserve both the in- and out-degree sequences for directed graphs. The trick is to select motifs where the two outward edges to be swapped have opposite directions (the directions of the edges between {x, y1} and {x, y2} don't matter):
I've also made some more optimizations, and the performance is starting to look a bit more respectable - it takes roughly half the number of iterations and half the total time compared with the diffusion approach. I've updated the graphs above with the new timings.
I came up with a graph generation model that can easily generate connected random graphs of some 10,000 nodes and more that follow prescribed degree and (local) clustering coefficient distributions which can be chosen such that any desired global clustering coefficient results. You can find a short description here. By the way, you will find your question (this one) in the references.
Kolda et al. proposed the BTER model (2013) that can generate random graphs with prescribed degree and clustering coefficient distribution (and thus prescribed global clustering index). It seems a bit more complicated than my model (see above), but maybe it's faster or generates less biased graphs. (But to be honest, I assume that my model doesn't generate severely biased graphs, neither, but essentially random graphs.)

Clustering structural 3D data

Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector of n 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the root-mean-square deviation of the aligned coordinates.
My question is, what would be the recommended way of clustering a large set of these structures in such a fashion as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way of doing this in python. By way of example, here's a trivialized set of unclustered structures (each is represented by the coordinates of the four vertices):
And then the desired clustering (using two clusters):
I've tried aligning all of the structures to a reference structure (i.e. the first structure) and then performing k-means on the distances between the reference and aligned coordinate using Pycluster.kcluster, but this seems a little bit clumsy and doesn't work so well. The structures in each cluster don't end up being very similar to each other. Ideally this clustering wouldn't be done on the difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n,3) rather than the (n,) required for k-means clustering.
The other option I tried is scipy.clustering.hierarchical. This seems to work pretty well, but I'm having trouble deciding which cluster is the most populated since one can always find a more populated cluster by moving up to the next branch of the tree.
Any thoughts or suggestions or ideas about different (already implemented in python) clustering algorithms would be greatly appreciated.
To give an introductory answer to my own question, I'll suggest that one can use the list of distances between each point in the shape as the metric to perform the clustering on.
Let's create some shapes:
shapes = np.array([[[1,4],[4,2],[11,2],[14,0]],
[[4,5],[4,2],[11,2],[13,0]],
[[1,3],[4,2],[11,2],[14,1.5]],
[[3,5],[4,2],[10,7],[7,9]],
[[5,5],[4,2],[10,7],[6,6]]])
def random_rotation():
theta = 3 * np.pi * np.random.random()
rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
return rotMatrix
new_shapes = []
for s in shapes:
rr = random_rotation()
new_shapes += [[list(rr.dot(p)) + [0] for p in s]]
new_shapes = np.array(new_shapes)
for i, s in enumerate(new_shapes):
plot(s[:,0], s[:,1], color='black')
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Then we create some auxiliary functions and create an array containing all the inter-vertex distances for each shape (darray).
import itertools as it
def vec_distance(v1, v2):
'''
The distance between two vectors.
'''
diff = v2 - v1
return math.sqrt(sum(diff * diff))
def distances(s):
'''
Compute the distance array for a shape s.
'''
ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)]
return np.array(ds)
# create an array of inter-shape distances for each shape
darray = np.array([distances(s) for s in new_shapes])
Cluster them into two clusters using Pycluster.
import Pycluster as pc
clust = pc.kcluster(darray,2)
print clust
And see that we end up with three entries in the first cluster and two in the other.
(array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1)
But which shapes do they correspond to?
import brewer2mpl
dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors
for i,(s,c) in enumerate(zip(new_shapes, clust[0])):
plot(s[:,0], s[:,1], color=dark2[c])
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Looks good! The problem is that as shapes get larger, the distance array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and suggests some solutions (like SVD for what I presume is a form of dimensionality reduction) to speed it up.
I'm not going to accept my answer just yet because I'm interested in any other ideas or thoughts about how to approach this simple problem.

Categories