I have a dataset of drugs represented as a graph, each of which is described by three non-square matrices:
edge index (A), an 2xe matrix, where e are the bonds of the molecule, the first line indicates the node (atom) from which the edge (bond) starts, and the second one the node where the edge arrives;
node feature matrix (X), an nx9 matrix, where n are the atoms of the molecule and 9 are the features used to describe these (e.g. atomic number, charge, hybridization);
edge feature matrix (E), an 4xe matrix, where e are the bonds of the molecule and 4 are the features used to describe these (e.g. type of bond, geometry).
I would like to plot these data on a Cartesian space to see if clusters are created based on their activity label. I thought, if I can reduce each matrix to a single point in space for each graph I will have three x, y, z coordinates, and then it will be very easy to plot the points. Does this make sense in your opinion? How could I go about turning a matrix into a single point using python? Finally, I leave you with an example of the graph I would like to create
Thank you all!
Assuming:
The nodes in a drug's graph represent features that every drug has to different extents, including zero.
The structure of a drug's graph models the extent to which every feature applies to that that drug
There is an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature applies the the drug
Then:
Construct a table where each row models a drug and each column is for a feature. Each cell then contains the "extent" to which the column's feature applies to the row's drug.
Apply the K-Means algorithm to the table.
The challenge is, of course: an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature.
IMHO the first step is to enter your data into a graph theory library. I see you are using Python. Python folks generally use a library called networkx. Are you familiar with this library?
Personally, I much prefer to work with C++ ( it gives the performance required for large problem sets ) Recently, I added a SMILES parser to my C++ graph library.
Convert the SMILES representation of each drug to its graph representation
Calculate the graph edit distance ( GED https://en.wikipedia.org/wiki/Graph_edit_distance ) between every pair of drugs
LOOP GEDMAX from 1 to 10
Add a connection between two drugs if the GED is less than GEDMAX. This forms a new graph we can call "GEDgraph"
Find the components ( clusters of drugs all reachable from each other in the GEDgraph )
SELECT "best" set of components
Related
I am trying to create a minimum spanning tree (MST) from geographical coordinates using scipy, but for the life of me I cannot understand how to extract information to it . The scipy documentation is not very clear and multiple searches have not provided results.
For context in total I have around 200k datapoints per set and they look like this
My final objective is to create a line vector that connects these points through the MST, more or less as they appear in the image above. But for that I need an ordered list of point indices (or coordinates) I can work with.
Most of all I would need help understanding how to use the output of minimum_spanning_tree but it might be that I am making mistakes along the path
Overall steps
The steps I take are:
Create the sparse matrix with coordinate info
provide the matrix to scipy.sparse.csgraph.minimum_spanning_tree
Do some magic to extract column values
This is the small sample test data:
test_data = {
"index": [0,1,2,3,4],
"X": [35,36,37,38,38],
"Y": [2113,2113,2112,2101,2102]
}
df= pd.DataFrame(test_data)
Step 1, create the sparse matrix
xs = df[["X"]].values.squeeze().astype(int)
ys = df[["Y"]].values.squeeze().astype(int)
data= np.array(df.index).squeeze().astype(int)
max_dim =max(np.max(xs), np.max(ys)) +1
max_dim
dist_matr=csr_matrix((data, (xs,ys)),shape=(max_dim, max_dim))
Q1:I couldn't understand what data is in this context as scipy docs do not explain that in detail. should {data} be the labels of the points or are they the edge weights?
Step2: calculate the minimum spanning tree
mst = minimum_spanning_tree(dist_matr)
Step3: get an ordered list of indices (or coordinates)
As I understand it the output of MST is a sparse graph that should look something like this (source)
Q2: However, my matrix is not 5X5, but max_value*max_value (2113 in this case). And it seems like the content of the matrix is not the edge weight. Am I getting this wrong?
I have tried to extract the connected components, but the labels don't make sense to me
# Label connected components.
num_graphs, labels = connected_components(mst, directed=False)
# This is a snippet I found somewhere but I have difficulties following the logic of it
results = [[] for i in range(max(labels) + 1)]
for idx, label in enumerate(labels):
results[label].append(idx)
portion of the results:
As you can see point coordinates are grouped in an odd way, without a relationship between x and y. I have also tried 'depth_first_order' but aside for asking a starting point (that I wouldn't know how to choose) it provides me with equally confusing outputs
Q4: How do I "read" the MST matrix and extract the minimum spanning tree for all points?
I am happy to explore other solutions as long as they provide a similar result and are scalable, however I have seen concerns about NetworkX for lots of data and MisTree doesn't install on my setup
I'm having troubles programming to get the distance matrix from a Graph of coordinates (LAT,LON).
I want to connect an arbitrarily group of points (let's say, 200.000 firms), get their nearests representations in the Graph, created with ox.graph_from_place().
I am working with dask arrays and dataframes (da.array, df.DataFrame)
if __name__ == "__main__":
# OPTION B: Use a strongly (instead of weakly) connected graph
Gs = ox.utils_graph.get_largest_component(G, strongly=True)
Gs.__name__ = "Gs"
# attach nearest network node to each firm
df["nn"] = da.array(ox.get_nearest_nodes(Gs, X=df['longitude'], Y=df['latitude'], method='balltree') )
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(df['nn'].unique())
nodes_unique.index = nodes_unique.values
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(Gs)
G_dm.__name__ = "G_dm"
save_pickle(Gs)
save_pickle(G_dm)
print("len df['nn']:", len(df['nn']))
print("len nodes_unique:", len(nodes_unique))
Some code has been avoided to go to the heart of the matter, then, I've tried following https://networkx.org/documentation/stable/reference/algorithms/shortest_paths.html and a network_distance_matrix() function using a strongly connected graph Gs, but this is so inefficient in computation time. I have seen in the docs a bunch of functions, but I haven't seen one to efficiently compute a distance-matrix between pairs of unique nodes belonging to the Graph and following paths through it.
I would like to know if there is any way to parallelize this process, and/or make it go through a generative way and not storing all that RAM.
My objective is to provide a pre-computed matrix in a DBSCAN (sklearn.clustering) model, and it has to be quickly so that I can kind of "grid-search" through parameters. I am a beginner to these libraries.
Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.
Given a scale free graph G ( a graph whose degree distribution is a power law), and the following procedure:
for i in range(C):
coint = randint(0,1)
if (coint == 0):
delete_random_edges(G)
else:
add_random_edge(G)
(C is a constant)
So, when C is large, the degree distribution after the procedure would be more like G(n,p). I am interested in preserving the power law distribution, i.e. - I want the graph to be scale free after this procedure, even for large C.
My idea is writing the procedures "delete_random_edges" and "add_random_edge" in a way that will give edges that connected to node with big degree small probability to be deleted (when adding new edge, it would be more likely to add it to node with large degree).
I use Networkx to represent the graph, and all I found is procedures that delete or add a specific edge. Any idea how can I implement the above?
Here's 2 algorithms:
Algorithm 1
This algorithm does not preserve the degree exactly, rather it preserves the expected degree.
Save each node's initial degree. Then delete edges at random. Whenever you create an edge, do so by randomly choosing two nodes, each with probability proportional to the initial degree of those nodes.
After a long period of time, the expected degree of each node 'u' is its initial degree (but it might be a bit higher or lower).
Basically, this will create what is called a Chung-Lu random graph. Networkx has a built in algorithm for creating them.
Note - this will allow the degree distribution to vary.
algorithm 1a
Here is the efficient networkx implementation skipping over the degree deleting and adding and going straight to the final result (assuming a networkx graph G):
degree_list = G.degree().values()
H = nx.expected_degree_graph(degree_list)
Here's the documentation
Algorithm 2
This algorithm preserves the degrees exactly.
Choose a set of edges and break them. Create a list, with each node appearing equal to the number of broken edges it was in. Shuffle this list. Create new edges between nodes that appear next to each other in this list.
Check to make sure you never join a node to itself or to a node which is already a neighbor. If this would occur you'll want to think of a custom way to avoid it. One option is to simply reshuffle the list. Another is to set those nodes aside and include them in the list you create next time you do this.
edit
There is a built in networkx command double_edge_swapto swap two edges at a time. documentation
Although you have already accepted the answer from #abdallah-sobehy, meaning that it works, I would suggest a more simple approach, in case it helps you or anybody around.
What you are trying to do is sometimes called preferential attachment (well, at least when you add nodes) and for that there is a random model developed quite some time ago, see Barabasi-Albert model, which leads to a power law distribution of gamma equals -3.
Basically you have to add edges with probability equal to the degree of the node divided by the sum of the degrees of all the nodes. You can scipy.stats for defining the probability distribution with a code like this,
import scipy.stats as stats
x = Gx.nodes()
sum_degrees = sum(list(Gx.degree(Gx).values()))
p = [Gx.degree(x)/sum_degrees for x in Gx]
custm = stats.rv_discrete(name='custm', values=(x, p))
Then you just pick 2 nodes following that distribution, and that's the 2 nodes you add an edge to,
custm.rvs(size=2)
As for deleting the nodes, I haven't tried that myself. But I guess you could use something like this,
sum_inv_degrees = sum([1/ x for x in list(Gx.degree(Gx).values())])
p = [1 / (Gx.degree(x) * sum_inv_degrees) for x in Gx]
although honestly I am not completely sure; it is not anymore the random model that I link to above...
Hope it helps anyway.
UPDATE after comments
Indeed by using this method for adding nodes to an existing graph, you could get 2 undesired outcomes:
duplicated links
self links
You could remove those, although it will make the results deviate from the expected distribution.
Anyhow, you should take into account that you are deviating already from the preferential attachment model, since the algorithm studied by Barabasi-Albert works adding new nodes and links to the existing graph,
The network begins with an initial connected network of m_0 nodes.
New nodes are added to the network one at a time. Each new node is connected to m > m_0 existing nodes with a probability that is proportional to the number
...
(see here)
If you want to get an exact distribution (instead of growing an existing network and keeping its properties), you're probably better off with the answer from #joel
Hope it helps.
I am not sure to what extent this will preserve the scale free property but this can be a way to implement your idea:
In order to add an edge you need to specify 2 nodes in networkx. so, you can choose one node with a probability that is proportional to (degree) and the other node to be uniformly chosen (without any preferences). Choosing a highly connected node can be achieved as follows:
For a graph G where nodes are [0,1,2,...,n]
1) create a list of floats (limits) between 0 and 1 to specify for each node a probability to be chosen according to its degree^2. For example: limits[1] - limits[0] is the probability to choose node 0, limits[2] - limits[1] is probability to choose node 2 etc.
# limits is a list that stores floats between 0 and 1 which defines
# the probabaility of choosing a certain node depending on its degree
limits = [0.0]
# store total number of edges of the graph; the summation of all degrees is 2*num_edges
num_edges = G.number_of_edges()
# store the degree of all nodes in a list
degrees = G.degree()
# iterate nodes to calculate limits depending on degree
for i in G:
limits.append(G.degree(i)/(2*num_edges) + limits[i])
2) Randomly generate a number between 0 and 1 then compare it to the limits, and choose the node to add an edge to accordingly:
rnd = np.random.random()
# compare the random number to the limits and choose node accordingly
for j in range(len(limits) - 1):
if rnd >= limits[j] and rnd < limits[j+1]:
chosen_node = G.node[j]
3) Choose another node uniformly, by generating a random integer between [0,n]
4) Add an edge between both of the chosen nodes.
5) Similarly for deleting edge, you can choose a node according to (1/degree) instead of degree then uniformly delete any of its edges.
It is interesting to know if using this approach would reserve the scale free property and at which 'C' the property is lost , so let us know if it worked or not.
EDIT: As suggested by #joel the selection of the node to add an edge to should be proportional to degree rather than degree^2. I have edited step 1 accordingly.
EDIT2: This might help you to be able to judge if the scale free graph lost its property after edges addition and removal. Simply comute the preferential attachmet score before and after changes. You can find the doumentation here.
in Networkx, how can I cluster nodes based on nodes color? E.g., I have 100 nodes, some of them are close to black, while others are close to white. In the graph layout, I want nodes with similar color stay close to each other, and nodes with very different color stay away from each other. How can I do that? Basically, how does the edge weight influence the layout of spring_layout? If NetworkX cannot do that, is there any other tools can help to calculate the layout?
Thanks
Ok, lets build us adjacency matrix W for that graph following the simple procedure:
if both of adjacent vertexes i-th and j-th are of the same color then weight of the edge between them W_{i,j} is big number (which you will tune in your experiments later) and else it is some small number which you will figure out analogously.
Now, lets write Laplacian of the matrix as
L = D - W, where D is a diagonal matrix with elements d_{i,i} equal to the sum of W i-th row.
Now, one can easily show that the value of
fLf^T, where f is some arbitrary vector, is small if vertexes with huge adjustments weights are having close f values. You may think about it as of the way to set a coordinate system for graph with i-the vertex has f_i coordinate in 1D space.
Now, let's choose some number of such vectors f^k which give us representation of the graph as a set of points in some euclidean space in which, for example, k-means works: now you have i-th vertex of the initial graph having coordinates f^1_i, f^2_i, ... and also adjacent vectors of the same color on the initial graph will be close in this new coordinate space.
The question about how to choose vectors f is a simple one: just take couple of eigenvectors of matrix L as f which correspond to small but nonzero eigenvalues.
This is a well known method called spectral clustering.
Further reading:
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. by Trevor Hastie, Robert Tibshirani and Jerome Friedman
which is available for free from the authors page http://www-stat.stanford.edu/~tibs/ElemStatLearn/