How to find similar predicted x between 2 models? - python

I have 2 models implemented with the same algorithm but with different number of features thus 2 different confusion matrix.
I would like to see which predicted items are similar between those 2 and plot the similarity predicted in a Venn diagram.

Answer
data = {"Mod1":[1,0,1,1,0,0,0,1,1,1],"Mod2":[1,0,1,0,1,0,0,1,0,1]}
df = pd.DataFrame(data)
df["Similar"] = np.where(df["Mod1"]==df["Mod2"],1,0)
df.head()
#output
Mod1Mod2Similar
0 1 1 1
1 0 0 1
2 1 1 1
3 1 0 0
4 0 1 0
This should do the job
Visualization
# !pip install matplotlib-venn
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets = (3, 3, 7), set_labels = ('Mod1', 'Mod2'))
plt.show()

Related

How to set individual bar plot's color in matplotlib?

I am trying to change the color of each individual bar in my figure here. The code that I used it down below. Instead of each bar changing to the color that I have set in c, there are several colors within each bar. I have included a screenshot of this. How can I fix this? Thank you all in advance!
Clusters is just a categorical variable of 5 groups, ranging from 0 to 4. I have included a second screenshot of the dataframe.
So essentially, what I am trying to do is to plot each cluster for economic ideology and social ideology so I can have a visual comparison of the 5 different clusters over these two dimensions (economic and social ideology). Each cluster should be represented by one color. For example, cluster 0 should be red in color.
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
plt.subplot(1, 2, 1)
plt.bar(data = ANESdf_LatNEW, height = "EconIdeo",
x = "clusters", color = c)
plt.title('Economic Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.subplot(1, 2, 2)
plt.bar(data = ANESdf_LatNEW, height = "SocialIdeo",
x = "clusters", color = c)
plt.title('Social Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.show()
Bar graph here
Top 5 rows of dataframe
I have tried multiple ways of changing colors. For example, instead of having c, I had put in the colors directly at color = ... This did not work either.
Here is a script that does what you seem to be looking for based on your edits and comment.
Note that I do not assume that all clusters have the same size in this context; if that is the case, this approach can be simplified.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# sample dataframe
df = pd.DataFrame(
{
'EconIdeo':[1,2,3,4,3,5,7],
'Clusters':[2,3,0,1,3,0,3]
})
print(df)
# parameters: width for each cluster, colors for each cluster
# (if clusters are not sequential from zero, replace c with dictionary)
width = .75
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
df['xpos'] = df['Clusters']
df['width'] = width
df['color'] = ''
clusters = df['Clusters'].unique()
for k in clusters:
where = (df['Clusters'] == k)
n = where.sum()
df.loc[where,'xpos'] += np.linspace(-width/2,width/2,2*n+1)[1:-1:2]
df.loc[where,'width'] /=n
df.loc[where,'color'] = c[k]
plt.bar(data = df, height = "EconIdeo", x = 'xpos',
width = 'width', color = 'color')
plt.xticks(clusters,clusters)
plt.show()
Resulting plot:
Input dataframe:
EconIdeo Clusters
0 1 2
1 2 3
2 3 0
3 4 1
4 3 3
5 5 0
6 7 3
Dataframe after script applies changes (to include plotting specifications)
EconIdeo Clusters xpos width color
0 1 2 2.0000 0.750 #278f36
1 2 3 2.7500 0.250 #47167a
2 3 0 -0.1875 0.375 #bf1111
3 4 1 1.0000 0.750 #1c4975
4 3 3 3.0000 0.250 #47167a
5 5 0 0.1875 0.375 #bf1111
6 7 3 3.2500 0.250 #47167a

Python: Plot Data with 0s 1s H-bond Data

I have a molecular dynamics simulation data. The system has 254 solute molecules and almost 12000 water molecules. The simulation has almost 4700 frames. I have extracted the H-bond data. The data is like if any of solute molecules show H-bond with any of the water molecule, it displays 1 otherwise 0. I want to plot H-bond data. So in total there is 254*4700 data points. The data is like as in given example
S1 S2 S3 S4 S5 ...
0 0 0 0 0 ...
0 0 0 0 0 ...
0 1 1 0 0 ...
0 0 0 0 0 ...
0 0 1 1 1 ...
0 0 0 0 1 ...
0 1 0 0 1 ...
0 0 0 0 1 ...
...
I want to plot like if the datapoint is 1, it shows a color otherwise if 0, no color (just like any other plot, e.g. scatter plot). Furthermore I want two axes on the plot such that
x-axis=Number of solutes (1 ... 254)
y-axis=number of frames (1 ... 4700)
So on y-axis only that datapoint related to x-axis should be colored that have 1.
Any help would be highly appreciated. Many thanks!
I would suggest plt.imshow for this task:
import matplotlib.pyplot as plt
import numpy as np
solutes = 254
frames = 4700
data = np.round(np.random.rand(frames, solutes))
plt.imshow(data, aspect='auto', interpolation='none')
plt.show()

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering technique. Sadly, I can't find examples of spectral clustering graphs in python online.
Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they're not aliases.
Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I've asked for such an example from the developers, but they're overworked and haven't gotten to it.
A good network to document this against is the Karate Club Network. It's included as a method in networkx.
I'd love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
A question much like this one has already been asked on this site.
Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get ground-truth: club-labels -> transform to 0/1 np-array
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectral-clustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅
Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)
...
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).
Alternatively, using precomputed, a user-provided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That's a pretty much perfect fit to the ground-truth!
Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let's first cluster a graph G into K=2 clusters and then generalize for all K.
We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.
Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx
import numpy as np
A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
nx.draw(G, pos=nx.drawing.layout.spring_layout(G),
with_labels=True, node_color=labels)
We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense())
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:
With SpectralClustering from sklearn.cluster we can get the exact same result:
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]
We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):
The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]])
K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

How to model data for tensorflow?

I have data of the form :
A B C D E F G
1 0 0 1 0 0 1
1 0 0 1 0 0 1
1 0 0 1 0 1 0
1 0 1 0 1 0 0
...
1 0 1 0 1 0 0
0 1 1 0 0 0 1
0 1 1 0 0 0 1
0 1 0 1 1 0 0
0 1 0 1 1 0 0
A,B,C,D are my inputs and E,F,G are my outputs. I wrote the following code in Python using TensorFlow:
from __future__ import print_function
#from random import randint
import numpy as np
import tflearn
import pandas as pd
data,labels =tflearn.data_utils.load_csv('dummy_data.csv',target_column=-1,categorical_labels=False, n_classes=None)
print(data)
# Build neural network
net = tflearn.input_data(shape=[None, 4])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 3, activation='softmax')
net = tflearn.regression(net)
# Define model
model = tflearn.DNN(net)
#Start training (apply gradient descent algorithm)
data_to_array = np.asarray(data)
print(data_to_array.shape)
#data_to_array= data_to_array.reshape(6,9)
print(data_to_array.shape)
model.fit(data_to_array, labels, n_epoch=10, batch_size=3, show_metric=True)
I am getting an error which says:
ValueError: Cannot feed value of shape (3, 6) for Tensor 'InputData/X:0', which has shape '(?, 4)'
I am guessing this is because my input data has 7 columns (0...6), but I want the input layer to take only the first four columns as input and predict the last 3 columns in the data as output. How can I model this?
If the data's in a numpy format, then the first 4 columns are taken with a simple slice:
data[:,0:4]
The : means "all rows", and 0:4 is a range of values 0,1,2,3, the first 4 columns.
If the data isn't in a numpy format, just convert it to a numpy format so you can slice easily.
Here's a related article on numpy slices: Numpy - slicing 2d row or column vector from array

How to represent boolean data in graph

How can I represent below data in comprehensive graph? Tried to with group by() from Pandas but the result in not comprehensive.
My objectif is to show what causes the most accidents between below combinations
pieton bicyclette camion_lourd vehicule
0 0 1 1
0 1 0 1
1 1 0 0
0 1 1 0
0 1 0 1
1 0 0 1
0 0 0 1
0 0 0 1
1 1 0 0
0 1 0 1
y = df.groupby(['pieton', 'bicyclette', 'camion_lourd', 'vehicule']).size()
y.unstack()
result:
Here are some visualizations that may help you:
#data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
columns = ['pieton', 'bicyclette', 'camion_lourd', 'vehicule']
df = pd.DataFrame([[0,0,1,1],[0,1,0,1],
[1,1,0,0],[0,1,1,0],
[1,0,0,1],[0,0,0,1],
[0,0,0,1],[1,1,0,0],
[0,1,0,1]], columns = columns)
You can start by seeing the proportion of accident per category:
# Set up a grid of plots
fig = plt.figure(figsize=(10,10))
fig_dims = (3, 2)
# Plot accidents depending on type
plt.subplot2grid(fig_dims, (0, 0))
df['pieton'].value_counts().plot(kind='bar',
title='Pieton')
plt.subplot2grid(fig_dims, (0, 1))
df['bicyclette'].value_counts().plot(kind='bar',
title='bicyclette')
plt.subplot2grid(fig_dims, (1, 0))
df['camion_lourd'].value_counts().plot(kind='bar',
title='camion_lourd')
plt.subplot2grid(fig_dims, (1, 1))
df['vehicule'].value_counts().plot(kind='bar',
title='vehicule')
Which gives:
Or if you prefer:
df.apply(pd.value_counts).plot(kind='bar',
title='all types')
But, more interestingly, I would do a comparison per pair. For example, for pedestrians:
pieton = {}
for col in columns:
pieton[col] = np.sum(df.pieton[df[col] == 1])
pieton.pop('pieton', None)
plt.bar(range(len(pieton)), pieton.values(), align='center')
plt.xticks(range(len(pieton)), pieton.keys())
plt.title("Who got an accident with a pedestrian?")
plt.legend(loc='best')
plt.show()
Which gives:
The similar plot can be done for bicycles, trucks and cars, giving:
It would be interesting to have more data points, to be able to draw better conclusions. However, this still tells us to watch out for bicycles if you are driving!
Hope this helped!

Categories