Python: Plot Data with 0s 1s H-bond Data - python

I have a molecular dynamics simulation data. The system has 254 solute molecules and almost 12000 water molecules. The simulation has almost 4700 frames. I have extracted the H-bond data. The data is like if any of solute molecules show H-bond with any of the water molecule, it displays 1 otherwise 0. I want to plot H-bond data. So in total there is 254*4700 data points. The data is like as in given example
S1 S2 S3 S4 S5 ...
0 0 0 0 0 ...
0 0 0 0 0 ...
0 1 1 0 0 ...
0 0 0 0 0 ...
0 0 1 1 1 ...
0 0 0 0 1 ...
0 1 0 0 1 ...
0 0 0 0 1 ...
...
I want to plot like if the datapoint is 1, it shows a color otherwise if 0, no color (just like any other plot, e.g. scatter plot). Furthermore I want two axes on the plot such that
x-axis=Number of solutes (1 ... 254)
y-axis=number of frames (1 ... 4700)
So on y-axis only that datapoint related to x-axis should be colored that have 1.
Any help would be highly appreciated. Many thanks!

I would suggest plt.imshow for this task:
import matplotlib.pyplot as plt
import numpy as np
solutes = 254
frames = 4700
data = np.round(np.random.rand(frames, solutes))
plt.imshow(data, aspect='auto', interpolation='none')
plt.show()

Related

Efficient way to find coordinates of connected blobs in binary image

I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")

Numpy Interpolation Between Points Within Array (scipy.griddata)

I have a numpy array of a fixed size holding irregularly spaced data. An example would be:
[1 0 0 0 3 0 0 0 2 0
0 1 0 0 0 0 0 0 2 0
0 1 0 0 1 0 6 0 9 0
0 0 0 0 6 0 3 0 0 1]
I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points.
I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity)
data = [0, 0, 1
0, 3, 3
0, 8, 2 ....................
I then define
points = (data[:,0], data[:,1])
values = (data[:,2])
Next, I define the points to sample at (in this case, the grid I desire)
grid = np.indices((4,10))
Finally, call griddata
t = interpolate.griddata(points, values, grid, method = 'linear')
This returns the following error
ValueError: number of dimensions in xi does not match x
Am I using the wrong function?
Thanks!
Solved: You need to pass the desired points as a tuple
t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering technique. Sadly, I can't find examples of spectral clustering graphs in python online.
Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they're not aliases.
Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I've asked for such an example from the developers, but they're overworked and haven't gotten to it.
A good network to document this against is the Karate Club Network. It's included as a method in networkx.
I'd love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
A question much like this one has already been asked on this site.
Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get ground-truth: club-labels -> transform to 0/1 np-array
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectral-clustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅
Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)
...
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).
Alternatively, using precomputed, a user-provided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That's a pretty much perfect fit to the ground-truth!
Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let's first cluster a graph G into K=2 clusters and then generalize for all K.
We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.
Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx
import numpy as np
A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
nx.draw(G, pos=nx.drawing.layout.spring_layout(G),
with_labels=True, node_color=labels)
We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense())
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:
With SpectralClustering from sklearn.cluster we can get the exact same result:
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]
We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):
The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]])
K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

How to represent boolean data in graph

How can I represent below data in comprehensive graph? Tried to with group by() from Pandas but the result in not comprehensive.
My objectif is to show what causes the most accidents between below combinations
pieton bicyclette camion_lourd vehicule
0 0 1 1
0 1 0 1
1 1 0 0
0 1 1 0
0 1 0 1
1 0 0 1
0 0 0 1
0 0 0 1
1 1 0 0
0 1 0 1
y = df.groupby(['pieton', 'bicyclette', 'camion_lourd', 'vehicule']).size()
y.unstack()
result:
Here are some visualizations that may help you:
#data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
columns = ['pieton', 'bicyclette', 'camion_lourd', 'vehicule']
df = pd.DataFrame([[0,0,1,1],[0,1,0,1],
[1,1,0,0],[0,1,1,0],
[1,0,0,1],[0,0,0,1],
[0,0,0,1],[1,1,0,0],
[0,1,0,1]], columns = columns)
You can start by seeing the proportion of accident per category:
# Set up a grid of plots
fig = plt.figure(figsize=(10,10))
fig_dims = (3, 2)
# Plot accidents depending on type
plt.subplot2grid(fig_dims, (0, 0))
df['pieton'].value_counts().plot(kind='bar',
title='Pieton')
plt.subplot2grid(fig_dims, (0, 1))
df['bicyclette'].value_counts().plot(kind='bar',
title='bicyclette')
plt.subplot2grid(fig_dims, (1, 0))
df['camion_lourd'].value_counts().plot(kind='bar',
title='camion_lourd')
plt.subplot2grid(fig_dims, (1, 1))
df['vehicule'].value_counts().plot(kind='bar',
title='vehicule')
Which gives:
Or if you prefer:
df.apply(pd.value_counts).plot(kind='bar',
title='all types')
But, more interestingly, I would do a comparison per pair. For example, for pedestrians:
pieton = {}
for col in columns:
pieton[col] = np.sum(df.pieton[df[col] == 1])
pieton.pop('pieton', None)
plt.bar(range(len(pieton)), pieton.values(), align='center')
plt.xticks(range(len(pieton)), pieton.keys())
plt.title("Who got an accident with a pedestrian?")
plt.legend(loc='best')
plt.show()
Which gives:
The similar plot can be done for bicycles, trucks and cars, giving:
It would be interesting to have more data points, to be able to draw better conclusions. However, this still tells us to watch out for bicycles if you are driving!
Hope this helped!

Is there any easy way to rotate the values of a matrix/array?

So, let's say I have the following matrix/array -
[0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
It would be fairly trivial to write something that would translate these values up and down. What if I wanted to rotate it by an angle that isn't a multiple of 90 degrees? I know that It is obviously impossible to get the exact same shape (made of 1s), because of the nature of the grid. The idea that comes to mind is converting each value of 1 to a coordinate vector. Then it would amount to rotating the coordinates (which should be more simple) about a point. One could then write something which would take the coordinates, and compare them to the matrix grid, and if there is a point in the right box, it will be filled. I know I'll also have to find a center around which to rotate.
Does this seem like a reasonable way to do this? If anyone has a better idea, I'm all ears. I know with a small grid like this, the shape would probably be entirely different, however if I had a large shape represented by 1s, in a large grid, the difference between representations would be smaller.
First of all, rotating a shape like that with only 1's and 0's at non 90 degree angles is not really going to look much like the original at all, when it's done at such a low "resolution". However, I would recommend looking into rotation matrices. Like you said, you would probably want to find each value as a coordinate pair, and rotate it around the center. It would probably be easier if you made this a two-dimensional array. Good luck!
I think this should work:
from math import sin, cos, atan2, radians
i0,j0 = 0,0 #point around which you'll rotate
alpha = radians(3) #3 degrees
B = np.zeros(A.shape)
for i,j in np.swapaxes(np.where(A==1),0,1):
di = i-i0
dj = j-j0
dist = (di**2 + dj**2)**0.5
ang = atan2(dj,di)
pi = round(sin(ang+alpha)*dist) + i0
pj = round(cos(ang+alpha)*dist) + j0
B[pi][pj] = 1
But, please, don't forget about segmentation fault!
B array should be much bigger than A and origin should be (optimally) in the middle of the array.

Categories