I have the following code to perform hierarchical clutering on data:
Z = linkage(data,method='weighted')
plt.subplot(2,1,1)
dendro = dendrogram(Z)
leaves = dendro['leaves']
print leaves
plt.show()
How ever at the dendogram all the clusters have the same color (blue). Is there a way to use different colors with respect to similarity in between clusters?
Look at the documentation, Looks like you could pass the link_color_func keyword or color_threshold keyword to have different colors.
Edit:
The default behavior of the dendrogram coloring scheme is, given a color_threshold = 0.7*max(Z[:,2]) to color all the descendent links below a cluster node k the same color if k is the first node below the cut threshold; otherwise, all links connecting nodes with distances greater than or equal to the threshold are colored blue [from the docs].
What the hell does this mean? Well, if you look at a dendrogram, different clusters linked together. The "distance" between two clusters is the height of the link between them. The color_threshold is the height below which new clusters will be different colors. If all your clusters are blue, then you need to raise your color_threshold. For example,
In [48]: mat = np.random.rand(10, 10)
In [49]: z = linkage(mat, method="weighted")
In [52]: d = dendrogram(z)
In [53]: d['color_list']
Out[53]: ['g', 'g', 'b', 'r', 'c', 'c', 'c', 'b', 'b']
In [54]: plt.show()
I can check what the default color_threshold is by
In [56]: 0.7*np.max(z[:,2])
Out[56]: 1.0278719020096947
If I lower the color_threshold, I get more blue because more links have distances greater than the new color_threshold. You can see this visually because all the links above 0.9 are now blue:
In [64]: d = dendrogram(z, color_threshold=.9)
In [65]: d['color_list']
Out[65]: ['g', 'b', 'b', 'r', 'b', 'b', 'b', 'b', 'b']
In [66]: plt.show()
If I increase the color_threshold to 1.2, the links below 1.2 will no longer be blue. Additionally, the cyan and red links will merge into a single color because their parent link is below 1.2:
The following code will produce a dendrogram with a different color for each leaf. If in the process of merging clusters it encounters two clusters with different colors, then it selects the default one dflt_col = tab:blue.
Note: the link_matrix function is a plain-copy of the one from the AgglomerativeClustering example in scikit-learn.
To explain what all it does, it's really time-consuming. Thus, print directly every unclear step.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import squareform, pdist
from matplotlib.pyplot import cm
from sklearn.cluster import AgglomerativeClustering
import matplotlib.colors as clrs
def link_matrix(model, **kwargs):
# Create linkage matrix and then plot the dendrogram as in the standard sci-kit learn documentation
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
Z = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
return Z
def assign_link_colors(model):
n_clusters = len(model.Z)
scl_map_to_hex = mpl.cm.ScalarMappable(cmap = "jet").to_rgba(np.unique(model.labels_), norm = True) #colors.to_hex()
col = [clrs.to_hex(rgb) for rgb in scl_map_to_hex]
dic_labels = {s:[c, idx] for s, c, idx in zip(np.arange(len(model.feature_names_in_), dtype = int), model.feature_names_in_, model.labels_, )}
model.dict_idx_name_cl = {k: v for k, v in sorted(dic_labels.items(), key=lambda item: item[1][1])}
dflt_col = "tab:blue" # Unclustered blue
model.dict_colors = {x:col[model.dict_idx_name_cl[x][1]] for x in model.dict_idx_name_cl}
link_cols = {}
for i, i_cl in enumerate(model.Z[:,:2].astype(int)): # select only 1st two rows
c1, c2 = (link_cols[x] if x > n_clusters else model.dict_colors[x] for x in i_cl)
# Choice of coloring assignment: if same color --> ok; if no leaf, dft ("undefined") color
if c1 == c2:
tmp_cl = c1
elif min(i_cl) <= n_clusters: # select the leaf color
tmp_cl = model.dict_colors[min(i_cl)]
else:
tmp_cl = dflt_col
link_cols[i+1+n_clusters] = tmp_cl
#print(f'-link_cols: {link_cols}',)
return link_cols
def mod_2_dendrogram(model, **kwargs):
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(int(.5 * len(model.feature_names_in_)), 7))
print(f'-0.7*max(Z[:,2]): {0.7*max(model.Z[:,2])}',)
# Plot the corresponding dendrogram
ddata = dendrogram(model.Z, #count_sort = "descending",
**kwargs)
# Plot distances on the dendrogram
# plot cluster points & distance labels
y_lim = dist_thr
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
x = sum(i[1:3])/2
y = d[1]
if y > y_lim:
plt.plot(x, y, 'o', c=c, markeredgewidth=0)
plt.annotate(np.round(y,2), (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center', fontsize=9)
plt.axhline(y=dist_thr, color='orange', alpha = 0.7, linestyle='--', label = f"threshold: {int(model.dist_thr)}")
plt.title(f'Agglomerative Dendrogram with n_clust: {model.n_clusters_}')
plt.xlabel('Clusters')
plt.ylabel('Distance')
plt.legend()
return ddata
Now, the running example:
import string
import pandas as pd
np.random.seed(0)
dist = np.random.randint(1e4, size = (10,10))
np.fill_diagonal(dist, 0)
dist = pd.DataFrame(dist, columns = list(string.ascii_lowercase)[:dist.shape[0]])
dist_thr = 1.5e3
model = AgglomerativeClustering(distance_threshold = dist_thr, n_clusters=None, linkage = "single", metric = "precomputed",)
model.dist_thr = dist_thr
model = model.fit(dist)
model.Z = link_matrix(model)
link_cols = assign_link_colors(model)
_ = mod_2_dendrogram(model, labels = dist.columns,
link_color_func = lambda x: link_cols[x])
Related
I have a community list as the following list_community.
How do I edit the code below to make the community visible?
from igraph import *
list_community = [['A', 'B', 'C', 'D'],['E','F','G'],['G', 'H','I','J']]
list_nodes = ['A', 'B', 'C', 'D','E','F','G','H','I','J']
tuple_edges = [('A','B'),('A','C'),('A','D'),('B','C'),('B','D'), ('C','D'),('C','E'),
('E','F'),('E','G'),('F','G'),('G','H'),
('G','I'), ('G','J'),('H','I'),('H','J'),('I','J'),]
# Make a graph
g_test = Graph()
g_test.add_vertices(list_nodes)
g_test.add_edges(tuple_edges)
# Plot
layout = g_test.layout("kk")
g.vs["name"] = list_nodes
visual_style = {}
visual_style["vertex_label"] = g.vs["name"]
visual_style["layout"] = layout
ig.plot(g_test, **visual_style)
I would like a plot that visualizes the community as shown below.
I can also do this by using a module other than igraph.
Thank you.
In igraph you can use the VertexCover to draw polygons around clusters (as also suggested by Szabolcs in his comment). You have to supply the option mark_groups when plotting the cover, possibly with some additional palette if you want. See some more detail in the documentation here.
In order to construct the VertexCover, you first have to make sure you get integer indices for each node in the graph you created. You can do that using g_test.vs.find.
clusters = [[g_test.vs.find(name=v).index for v in cl] for cl in list_community]
cover = ig.VertexCover(g_test, clusters)
After that, you can simply draw the cover like
ig.plot(cover,
mark_groups=True,
palette=ig.RainbowPalette(3))
resulting in the following picture
Here is a script that somewhat achieves what you're looking for. I had to handle the cases of single-, and two-nodes communities separately, but for greater than two nodes this draws a polygon within the nodes.
I had some trouble with matplotlib not accounting for overlapping edges and faces of polygons which meant the choice was between (1) not having the polygon surround the nodes or (2) having an extra outline just inside the edge of the polygon due to matplotlib overlapping the widened edge with the fill of the polygon. I left a comment on how to change the code from option (2) to option (1).
I also blatantly borrowed a convenience function from this post to handle correctly sorting the nodes in the polygon for appropriate filling by matplotlib's plt.fill().
Option 1:
Option 2:
Full code:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
def sort_xy(x, y):
x0 = np.mean(x)
y0 = np.mean(y)
r = np.sqrt((x-x0)**2 + (y-y0)**2)
angles = np.where((y-y0) > 0, np.arccos((x-x0)/r), 2*np.pi-np.arccos((x-x0)/r))
mask = np.argsort(angles)
x_sorted = x[mask]
y_sorted = y[mask]
return x_sorted, y_sorted
G = nx.karate_club_graph()
pos = nx.spring_layout(G, seed=42)
fig, ax = plt.subplots(figsize=(8, 10))
nx.draw(G, pos=pos, with_labels=True)
communities = nx.community.louvain_communities(G)
alpha = 0.5
edge_padding = 10
colors = cm.get_cmap('viridis', len(communities))
for i, comm in enumerate(communities):
if len(comm) == 1:
cir = plt.Circle((pos[comm.pop()]), edge_padding / 100, alpha=alpha, color=colors(i))
ax.add_patch(cir)
elif len(comm) == 2:
comm_pos = {k: pos[k] for k in comm}
coords = [a for a in zip(*comm_pos.values())]
x, y = coords[0], coords[1]
plt.plot(x, y, linewidth=edge_padding, linestyle="-", alpha=alpha, color=colors(i))
else:
comm_pos = {k: pos[k] for k in comm}
coords = [a for a in zip(*comm_pos.values())]
x, y = sort_xy(np.array(coords[0]), np.array(coords[1]))
plt.fill(x, y, alpha=alpha, facecolor=colors(i),
edgecolor=colors(i), # set to None to remove edge padding
linewidth=edge_padding)
I created a heatmap based on spearman's correlation matrix using seaborn clustermap as folowing: I want to paint the dendrogram. I want the dendrogram to look like this:
dendrogram
but on the heatmap
I created a dict of colors as folowing and got an error:
def assign_tree_colour(name,val_dict,coding_names_df):
ret = None
if val_dict.get(name, '') == 'Group 1':
ret = "(0,0.9,0.4)" #green
elif val_dict.get(name, '') == 'Group 2':
ret = "(0.6,0.1,0)" #red
elif val_dict.get(name, '') == 'Group 3':
ret = "(0.3,0.8,1)" #light blue
elif val_dict.get(name, '') == 'Group 4':
ret = "(0.4,0.1,1)" #purple
elif val_dict.get(name, '') == 'Group 5':
ret = "(1,0.9,0.1)" #yellow
elif val_dict.get(name, '') == 'Group 6':
ret = "(0,0,0)" #black
else:
ret = "(0,0,0)" #black
return ret
def fix_string(str):
return str.replace('"', '')
external_data3 = [list(z) for z in coding_names_df.values]
external_data3 = {fix_string(z[0]): z[3] for z in external_data3}
tree_label = list(df.index)
tree_label = [fix_string(x) for x in tree_label]
tree_labels = { j : tree_label[j] for j in range(0, len(tree_label) ) }
tree_colour = [assign_tree_colour(label, external_data3, coding_names_df) for label in tree_labels]
tree_colors = { i : tree_colour[i] for i in range(0, len(tree_colour) ) }
sns.set(color_codes=True)
sns.set(font_scale=1)
g = sns.clustermap(df, cmap="bwr",
vmin=-1, vmax=1,
yticklabels=1, xticklabels=1,
cbar_kws={"ticks":[-1,-0.5,0,0.5,1]},
figsize=(13,13),
row_colors=row_colors,
col_colors=col_colors,
method='average',
metric='correlation',
tree_kws=dict(colors=tree_colors))
g.ax_heatmap.set_xlabel('Genus')
g.ax_heatmap.set_ylabel('Genus')
for label in Group.unique():
g.ax_col_dendrogram.bar(0, 0, color=lut[label],
label=label, linewidth=0)
g.ax_col_dendrogram.legend(loc=9, ncol=7, bbox_to_anchor=(0.26, 0., 0.5, 1.5))
ax=g.ax_heatmap
File "<ipython-input-64-4bc6be89afe3>", line 11, in <module>
tree_kws=dict(colors=tree_colors))
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py", line 1391, in clustermap
tree_kws=tree_kws, **kwargs)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py", line 1208, in plot
tree_kws=tree_kws)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py", line 1054, in plot_dendrograms
tree_kws=tree_kws
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py", line 776, in dendrogram
return plotter.plot(ax=ax, tree_kws=tree_kws)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py", line 692, in plot
**tree_kws)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\collections.py", line 1316, in __init__
colors = mcolors.to_rgba_array(colors)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\colors.py", line 294, in to_rgba_array
result[i] = to_rgba(cc, alpha)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\colors.py", line 177, in to_rgba
rgba = _to_rgba_no_colorcycle(c, alpha)
File "C:\Users\rotemb\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\colors.py", line 240, in _to_rgba_no_colorcycle
raise ValueError("Invalid RGBA argument: {!r}".format(orig_c))
ValueError: Invalid RGBA argument: 0
Any help on this would be greatly appreciated!
Tnx!
According to sns.clustermap documentation, the dendrogram coloring can be set through tree_kws (takes a dict) and its colors attribute which expects a list of RGB tuples such as (0.5, 0.5, 1). It seems also that colors supports nothing except RGB tuple format data.
Did you notice that clustermap supports nested lists or data frames for hierarchical colorbars in between dendrograms and the correlation matrix? They could be useful if the dendrograms get too crowded.
I hope this helps!
Edit
The list of RGB is the sequence of line colors in LineCollection — it uses the sequence as it draws each line in both dendrograms. (The order seems that the order starts from the rightmost branch of the column dendrogram) In order to associate a certain label with a data point, you need to figure out the drawing order of data points in dendrograms.
Edit II
Here's a minimal example for coloring the tree based on sns.clustermap examples:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
import pandas as pd
iris = sns.load_dataset("iris")
species = iris.pop("species")
g = sns.clustermap(iris)
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
# For demonstrating the hierarchical sidebar coloring
df_colors = pd.DataFrame(data={'r': row_colors[row_colors == 'r'], 'g': row_colors[row_colors == 'g'], 'b': row_colors[row_colors == 'b']})
# Simple class RGBA colormap
colmap = {'setosa': (1, 0, 0, 0.7), 'virginica': (0, 1, 0, 0.7), 'versicolor': (0, 0, 1, 0.7)}
g = sns.clustermap(iris, row_colors=df_colors, tree_kws={'colors':[colmap[s] for s in species]})
plt.savefig('clustermap.png')
As you can see, the order of the drawn lines of the tree start from the upper right corner of the image thus not being tied to the order of the data points visualized in clustermap. On the other hand, the color bars (controlled by {row,col}_colors attributes) could be used for that purpose.
Building on the answer above, here is the example coloring the main three branches differently, brute force (the first 49 lines in red, the next 35 lines in green and the last 62 lines in blue, remaining two lines in black):
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
import pandas as pd
iris = sns.load_dataset("iris")
species = iris.pop("species")
g = sns.clustermap(iris)
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
# For demonstrating the hierarchical sidebar coloring
df_colors = pd.DataFrame(data={'r': row_colors[row_colors == 'r'], 'g': row_colors[row_colors == 'g'], 'b': row_colors[row_colors == 'b']})
# Simple class RGBA colormap
colmap = {'setosa': (1, 0, 0, 0.7), 'virginica': (0, 1, 0, 0.7), 'versicolor': (0, 0, 1, 0.7)}
g = sns.clustermap(iris, row_colors=df_colors, tree_kws={'colors':[(1,0,0,1)]*49+[(0,1,0,1)]*35+[(0,0,1,1)]*63+[(0,0,0,1)]*2})
plt.savefig('clustermap.png')
For the general case, the number of lines to color can be derived from the dendrogram (described here scipy linkage format):
# The number of leaves is always the number of merges + 1
# (if we have 2 leaves we do 1 merge)
n_leaves = len(g.dendrogram_row.linkage)+1
# The last merge on the array is naturally the one that joins
# the last two broad clusters together
n0_ndx = len(g.dendrogram_row.linkage) - 1
# At index [n0_ndx] of the linkage array, positions [0] and [1],
# we have the "indexes" of the two clusters that were merged.
# However, in order to find the actual index of these two
# clusters in the linkage array, we must subtract from this
# position (cluster/element number) the total number of leaves,
# because the cluster number listed here starts at 0 with the
# individual elements given to the function; and these elements
# are not themselves part of the linkage array.
# So linkage[0] has cluster number equal to n_leaves; and conversely,
# to calculate the index of a cluster in the linkage array,
# we must subtract the value of n_leaves from the cluster number.
n1_ndx = int(g.dendrogram_row.linkage[n0_ndx][0])-n_leaves
n2_ndx = int(g.dendrogram_row.linkage[n0_ndx][1])-n_leaves
# Similarly we can find the array index of clusters further down
n21_ndx = int(g.dendrogram_row.linkage[n2_ndx][0])-n_leaves
n22_ndx = int(g.dendrogram_row.linkage[n2_ndx][1])-n_leaves
# And finally, having identified the array index of the clusters
# that we are interested in coloring, we can determine the number
# of members in each cluster, which is stored in position [3]
# of each element of the array
n1 = int(g.dendrogram_row.linkage[n1_ndx][3])-1
n21 = int(g.dendrogram_row.linkage[n21_ndx][3])-1
n22 = int(g.dendrogram_row.linkage[n22_ndx][3])-1
# So we can finally color, with RGBa tuples, an amount of elements
# equal to the number of elements in each cluster of interest.
g = sns.clustermap(iris, row_colors=df_colors, tree_kws={'colors':[(1,0,0,1)]*n1+[(0,1,0,1)]*n21+[(0,0,1,1)]*n22+[(0,0,0,1)]*(n_leave\
s-1-n1-n21-n22)})
Though, I have not figured out a way to color the top dendrogram differently...
Sometimes I get histograms that look like below:
I see the peaks loud and clear, but nigh for much else; is there a way to drop the "bin outliers" from a histogram so that the rest of the distribution can be seen better?
This can be accomplished by simply setting ylim; however, this rids of the peaks information. To retain, we can include it via annotations, as follows:
Fetching histogram heights, N, and positions, bins
Selecting a ymax; e.g. 2nd or 3rd max N
Packing (position, height) into a string, and annotating
All combined and an example below; I used your exact data for comparison, since you are me.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
data = np.random.randn(100, 100) ** 3
data[:50] = 0
hist_visible(data, peaks_to_clip=3, bins=500, annot_kw={})
data[:95] = 0
hist_visible(data, peaks_to_clip=3, bins=500, annot_kw={})
Function:
def hist_visible(data, peaks_to_clip=1, bins=200, annot_kw=None):
def _annotate(peaks_info, annot_kw):
def _process_annot_kw(annot_kw):
defaults = dict(weight='bold', fontsize=13, color='r',
xy=(.85, .85), xycoords='axes fraction')
if not annot_kw:
annot_kw = defaults.copy()
else:
annot_kw = annot_kw.copy() # ensure external dict unaffected
# if `defaults` key not in `annot_kw`, add it & its value
for k, v in defaults.items():
if k not in annot_kw:
annot_kw[k] = v
return annot_kw
def _make_annotation(peaks_info):
txt = ''
for entry in peaks_info:
txt += "({:.2f}, {})\n".format(entry[0], int(entry[1]))
return txt.rstrip('\n')
annot_kw = _process_annot_kw(annot_kw)
txt = _make_annotation(peaks_info)
plt.annotate(txt, **annot_kw)
N, bins, _ = plt.hist(np.asarray(data).ravel(), bins=bins)
Ns = np.sort(N)
lower_max = Ns[-(peaks_to_clip + 1)]
peaks_info = []
for peak_idx in range(1, peaks_to_clip + 1):
patch_idx = np.where(N == Ns[-peak_idx])[0][0]
peaks_info.append([bins[patch_idx], N[patch_idx]])
plt.ylim(0, lower_max)
if annot_kw is not None:
_annotate(peaks_info, annot_kw)
plt.show()
I want to cluster data of users by user_id, because I need to analyze each cluster after clustering.
my clustering algorithm is k-means/k=3. I'm using python.
my data:
V1,V2
100,10
150,20
200,10
120,15
300,10
400,10
300,10
400,10
I removed user_id column from this data. as far as I know that I should remove user_id for k-means clustering.
my python code:
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('C:/Users/S.M_Emamian/Desktop/xclara.csv')
print("Input Data and Shape")
print(data.shape)
data.head()
# Getting the values and plotting it
f1 = data['V1'].values
f2 = data['V2'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)
# Euclidean Distance Caculator
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids")
print(C)
# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')
# To store the value of centroids when it updates
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero
while error != 0:
# Assigning each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# Storing the old centroid values
C_old = deepcopy(C)
# Finding the new centroids by taking the average value
for i in range(k):
points = [X[j] for j in range(len(X)) if clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')
'''
==========================================================
scikit-learn
==========================================================
'''
from sklearn.cluster import KMeans
# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print("Centroid values")
print("Scratch")
print(C) # From Scratch
print("sklearn")
print(centroids) # From sci-kit learn
my code works fine and it visualizes my data as well.
but I need to keep user_id.
for example, I would like to know user_id=5 is Which of the clusters?
Just add user_id after clustering.
Actually, what you probably want to do is the opposite: just add the cluster label to your original data that still has the cluster labels.
As long as you don't change the data order this is a trivial stacking operation.
When plotting a network in Holoviews, how can I set the position of the nodes based on an attribute? I have a network with timestamps for each node, and would like to position the nodes based on the associated time.
I figured out how to set the x position based on an attribute, but I would still like holoviews to find the optimal y position (see Holoviews graph visualization: get optimal y position, given x position).
The code below sets the x position based on a node attribute:
import holoviews as hv
import numpy as np
import pandas as pd
import networkx as nx
N = 5
num_edges = 2
list1 = np.arange(N).tolist()*num_edges
list2 = np.array(list1)
np.random.shuffle(list2)
edgelist = pd.DataFrame({'vertex1': list1, 'vertex2': list2, 'weight': np.random.uniform(0, 1, len(list1))})
edgelist = edgelist[edgelist.vertex1 != edgelist.vertex2]
edgelist = edgelist.drop_duplicates()
times = pd.DataFrame({'vertex': np.arange(N), 'time': np.random.normal(0, 10, N)})
x = times.time
y = np.random.uniform(0, 1, N)
padding = dict(x=(np.min(x) - 1, np.max(x) + 1), y=(-1.2, 1.2))
node_indices = np.arange(N)
pos_graph = hv.Graph(((edgelist.vertex1, edgelist.vertex2), (x, y, node_indices))).redim.range(**padding)
pos_graph