I have a dataset of drugs represented as a graph, each of which is described by three non-square matrices:
edge index (A), an 2xe matrix, where e are the bonds of the molecule, the first line indicates the node (atom) from which the edge (bond) starts, and the second one the node where the edge arrives;
node feature matrix (X), an nx9 matrix, where n are the atoms of the molecule and 9 are the features used to describe these (e.g. atomic number, charge, hybridization);
edge feature matrix (E), an 4xe matrix, where e are the bonds of the molecule and 4 are the features used to describe these (e.g. type of bond, geometry).
I would like to plot these data on a Cartesian space to see if clusters are created based on their activity label. I thought, if I can reduce each matrix to a single point in space for each graph I will have three x, y, z coordinates, and then it will be very easy to plot the points. Does this make sense in your opinion? How could I go about turning a matrix into a single point using python? Finally, I leave you with an example of the graph I would like to create
Thank you all!
Assuming:
The nodes in a drug's graph represent features that every drug has to different extents, including zero.
The structure of a drug's graph models the extent to which every feature applies to that that drug
There is an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature applies the the drug
Then:
Construct a table where each row models a drug and each column is for a feature. Each cell then contains the "extent" to which the column's feature applies to the row's drug.
Apply the K-Means algorithm to the table.
The challenge is, of course: an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature.
IMHO the first step is to enter your data into a graph theory library. I see you are using Python. Python folks generally use a library called networkx. Are you familiar with this library?
Personally, I much prefer to work with C++ ( it gives the performance required for large problem sets ) Recently, I added a SMILES parser to my C++ graph library.
Convert the SMILES representation of each drug to its graph representation
Calculate the graph edit distance ( GED https://en.wikipedia.org/wiki/Graph_edit_distance ) between every pair of drugs
LOOP GEDMAX from 1 to 10
Add a connection between two drugs if the GED is less than GEDMAX. This forms a new graph we can call "GEDgraph"
Find the components ( clusters of drugs all reachable from each other in the GEDgraph )
SELECT "best" set of components
I am trying to convert a adjacency matrix into the torch_geometric.data.Data format. I am able to edge index list using csr_matrix.
I also wonder what I should put for x: Node feature matrix with shape [num_nodes, num_node_features], whether this should be the matrix of the edge weights? It would great to have some clarity on what the node feature columns relate to; an practical or theoretical example with a theoretical application would be great.
Help would be appreciated.
import torch
# creating tensor from targets_df
torch_tensor = torch.tensor(adjacencyMat_df.iloc[: , 1:].values) #https://stackoverflow.com/questions/50307707/convert-pandas-dataframe-to-pytorch-tensor
torch_tensor
import scipy.sparse as scpy
A=scpy.csr_matrix(torch_tensor)
print(A)
data = Data(x=x, edge_index=edge_index)
It is as I originally thought. It involves potentially the 90 degree rotation of the selected features, with each sample result along the columns. Then each row represents a single node.
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
I'm having troubles programming to get the distance matrix from a Graph of coordinates (LAT,LON).
I want to connect an arbitrarily group of points (let's say, 200.000 firms), get their nearests representations in the Graph, created with ox.graph_from_place().
I am working with dask arrays and dataframes (da.array, df.DataFrame)
if __name__ == "__main__":
# OPTION B: Use a strongly (instead of weakly) connected graph
Gs = ox.utils_graph.get_largest_component(G, strongly=True)
Gs.__name__ = "Gs"
# attach nearest network node to each firm
df["nn"] = da.array(ox.get_nearest_nodes(Gs, X=df['longitude'], Y=df['latitude'], method='balltree') )
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(df['nn'].unique())
nodes_unique.index = nodes_unique.values
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(Gs)
G_dm.__name__ = "G_dm"
save_pickle(Gs)
save_pickle(G_dm)
print("len df['nn']:", len(df['nn']))
print("len nodes_unique:", len(nodes_unique))
Some code has been avoided to go to the heart of the matter, then, I've tried following https://networkx.org/documentation/stable/reference/algorithms/shortest_paths.html and a network_distance_matrix() function using a strongly connected graph Gs, but this is so inefficient in computation time. I have seen in the docs a bunch of functions, but I haven't seen one to efficiently compute a distance-matrix between pairs of unique nodes belonging to the Graph and following paths through it.
I would like to know if there is any way to parallelize this process, and/or make it go through a generative way and not storing all that RAM.
My objective is to provide a pre-computed matrix in a DBSCAN (sklearn.clustering) model, and it has to be quickly so that I can kind of "grid-search" through parameters. I am a beginner to these libraries.
Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector of n 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the root-mean-square deviation of the aligned coordinates.
My question is, what would be the recommended way of clustering a large set of these structures in such a fashion as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way of doing this in python. By way of example, here's a trivialized set of unclustered structures (each is represented by the coordinates of the four vertices):
And then the desired clustering (using two clusters):
I've tried aligning all of the structures to a reference structure (i.e. the first structure) and then performing k-means on the distances between the reference and aligned coordinate using Pycluster.kcluster, but this seems a little bit clumsy and doesn't work so well. The structures in each cluster don't end up being very similar to each other. Ideally this clustering wouldn't be done on the difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n,3) rather than the (n,) required for k-means clustering.
The other option I tried is scipy.clustering.hierarchical. This seems to work pretty well, but I'm having trouble deciding which cluster is the most populated since one can always find a more populated cluster by moving up to the next branch of the tree.
Any thoughts or suggestions or ideas about different (already implemented in python) clustering algorithms would be greatly appreciated.
To give an introductory answer to my own question, I'll suggest that one can use the list of distances between each point in the shape as the metric to perform the clustering on.
Let's create some shapes:
shapes = np.array([[[1,4],[4,2],[11,2],[14,0]],
[[4,5],[4,2],[11,2],[13,0]],
[[1,3],[4,2],[11,2],[14,1.5]],
[[3,5],[4,2],[10,7],[7,9]],
[[5,5],[4,2],[10,7],[6,6]]])
def random_rotation():
theta = 3 * np.pi * np.random.random()
rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
return rotMatrix
new_shapes = []
for s in shapes:
rr = random_rotation()
new_shapes += [[list(rr.dot(p)) + [0] for p in s]]
new_shapes = np.array(new_shapes)
for i, s in enumerate(new_shapes):
plot(s[:,0], s[:,1], color='black')
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Then we create some auxiliary functions and create an array containing all the inter-vertex distances for each shape (darray).
import itertools as it
def vec_distance(v1, v2):
'''
The distance between two vectors.
'''
diff = v2 - v1
return math.sqrt(sum(diff * diff))
def distances(s):
'''
Compute the distance array for a shape s.
'''
ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)]
return np.array(ds)
# create an array of inter-shape distances for each shape
darray = np.array([distances(s) for s in new_shapes])
Cluster them into two clusters using Pycluster.
import Pycluster as pc
clust = pc.kcluster(darray,2)
print clust
And see that we end up with three entries in the first cluster and two in the other.
(array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1)
But which shapes do they correspond to?
import brewer2mpl
dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors
for i,(s,c) in enumerate(zip(new_shapes, clust[0])):
plot(s[:,0], s[:,1], color=dark2[c])
text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14)
Looks good! The problem is that as shapes get larger, the distance array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and suggests some solutions (like SVD for what I presume is a form of dimensionality reduction) to speed it up.
I'm not going to accept my answer just yet because I'm interested in any other ideas or thoughts about how to approach this simple problem.