How to label cluster after applying to k-mean clustering to dataset? - python

I have a dataset in .csv format which looks like this -
data
x,y,z, label
2,1,3, A
5,3,1, B
6,2,2, C
9,5,3, B
2,3,4, A
4,1,4, A
I would like to apply k-mean clustering to the above dataset. As we see above the 3 dimension dataset(x-y-z). And after that, I would like to visualize the clustering in 3-dimension with a specific cluster label in diagram. Please let know if you need more details.
I have used for 2-dimension dataset as see below -
kmeans_labels = cluster.KMeans(n_clusters=5).fit_predict(data)
And plot the visualize for 2-dimension dataset,
plt.scatter(standard_embedding[:, 0], standard_embedding[:, 1], c=kmeans_labels, s=0.1, cmap='Spectral');
Similarly, I would like to plot 3-dimension clustering with label. Please let me know if you need more details.

Could something like that be a good solution?
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
data = np.array([[2,1,3], [5,3,1], [6,2,2], [9,5,3], [2,3,4], [4,1,4]])
cluster_count = 3
km = KMeans(cluster_count)
clusters = km.fit_predict(data)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=clusters, alpha=1)
labels = ["A", "B", "C"]
for i, label in enumerate(labels):
ax.text(km.cluster_centers_[i, 0], km.cluster_centers_[i, 1], km.cluster_centers_[i, 2], label)
ax.set_title("3D K-Means Clustering")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
plt.show()
EDIT
If you want a legend instead, just do this:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
data = np.array([[2,1,3], [5,3,1], [6,2,2], [9,5,3], [2,3,4], [4,1,4]])
cluster_count = 3
km = KMeans(cluster_count)
clusters = km.fit_predict(data)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=clusters, alpha=1)
handles = scatter.legend_elements()[0]
ax.legend(title="Clusters", handles=handles, labels = ["A", "B", "C"])
ax.set_title("3D K-Means Clustering")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
plt.show()

Related

How to plot lines for individual rows in matplotlib?

Each row in the dataset has three datapoints. How can I plot a line for each one, as indicated?
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 0:2].values
y = dataset.iloc[:, -1].values
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], y, marker='.', color="red")
ax.set_xlabel("Cone")
ax.set_ylabel("Time")
ax.set_zlabel("Temp")
plt.show()
This is the data. SO wont let me save the post now that I have added the data because it says my question is mostly code, so I am writing this longwinded thing so that hopefully it lets me post. You can just ignore this paragraph. It is only here to balance out the code with prose so that Stack overflow will let me post.
cone,ramp,temp
4,15,1141
4,60,1162
4,150,1183
5,15,1159
5,60,1186
5,150,1207
6,15,1185
6,60,1222
6,150,1243
7,15,1201
7,60,1239
7,150,1257
8,15,1211
8,60,1249
8,150,1271
9,15,1224
9,60,1260
9,150,1280
10,15,1251
10,60,1285
10,150,1305
11,15,1272
11,60,1294
11,150,1315
12,15,1285
12,60,1306
12,150,1326
13,15,1310
13,60,1331
13,150,1348
14,15,1351
14,60,1365
14,150,1384
One way is to loop over the unique values of the cone column:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
for u in dataset["cone"].unique():
extracted_df = dataset[dataset["cone"] == u]
values = extracted_df.values
ax.plot(values[:, 0], values[:, 1], values[:, 2], color="red")
ax.set_xlabel("Cone")
ax.set_ylabel("Time")
ax.set_zlabel("Temp")
plt.show()

PLS-DA Loading Plot in Python

How can I make a Loading plot with Matplotlib of a PLS-DA plot, like the loading plot like that of PCA?
This answer explains how it can be done with PCA:
Plot PCA loadings and loading in biplot in sklearn (like R's autoplot)
However there are some significant differences between the two methods which makes the implementation different as well. (Some of the relevant differences are explained here https://learnche.org/pid/latent-variable-modelling/projection-to-latent-structures/interpreting-pls-scores-and-loadings )
To make the PLS-DA plot I use the following code:
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
import numpy as np
import pandas as pd
targets = [0, 1]
x_vals = StandardScaler().fit_transform(df.values)
y = [g == targets[0] for g in sample_description]
y = np.array(y, dtype=int)
plsr = PLSRegression(n_components=2, scale=False)
plsr.fit(x_vals, y)
colormap = {
targets[0]: '#ff0000', # Red
targets[1]: '#0000ff', # Blue
}
colorlist = [colormap[c] for c in sample_description]
scores = pd.DataFrame(plsr.x_scores_)
scores.index = x.index
x_loadings = plsr.x_loadings_
y_loadings = plsr.y_loadings_
fig1, ax = get_default_fig_ax('Scores on LV 1', 'Scores on LV 2', title)
ax = scores.plot(x=0, y=1, kind='scatter', s=50, alpha=0.7,
c=colorlist, ax=ax)
I took your code and enhanced it. The biplot is obtained via simply overlaying the score and the loading plot.
Other, more rigerous plots could be made with truely shared axis according to https://blogs.sas.com/content/iml/2019/11/06/what-are-biplots.html#:~:text=A%20biplot%20is%20an%20overlay,them%20on%20a%20single%20plot.
The code below generates this image for a dataset with ~200 features (therefore there are ~200 red arrows shown):
from sklearn.cross_decomposition import PLSRegression
pls2 = PLSRegression(n_components=2)
pls2.fit(X_train, Y_train)
x_loadings = pls2.x_loadings_
y_loadings = pls2.y_loadings_
fig, ax = plt.subplots(constrained_layout=True)
scores = pd.DataFrame(pls2.x_scores_)
scores.plot(x=0, y=1, kind='scatter', s=50, alpha=0.7,
c=Y_train.values[:,0], ax = ax)
newax = fig.add_axes(ax.get_position(), frameon=False)
feature_n=x_loadings.shape[0]
print(x_loadings.shape)
for feature_i in range(feature_n):
comp_1_idx=0
comp_2_idx=1
newax.arrow(0, 0, x_loadings[feature_i,comp_1_idx], x_loadings[feature_i,comp_2_idx],color = 'r',alpha = 0.5)
newax.get_xaxis().set_visible(False)
newax.get_yaxis().set_visible(False)
plt.show()

how to generate a series of histograms on matplotlib?

I would like to generate a series of histogram shown below:
The above visualization was done in tensorflow but I'd like to reproduce the same visualization on matplotlib.
EDIT:
Using plt.fill_between suggested by #SpghttCd, I have the following code:
colors=cm.OrRd_r(np.linspace(.2, .6, 10))
plt.figure()
x = np.arange(100)
for i in range(10):
y = np.random.rand(100)
plt.fill_between(x, y + 10-i, 10-i,
facecolor=colors[i]
edgecolor='w')
plt.show()
This works great, but is it possible to use histogram instead of a continuous curve?
EDIT:
joypy based approach, like mentioned in the comment of october:
import pandas as pd
import joypy
import numpy as np
df = pd.DataFrame()
for i in range(0, 400, 20):
df[i] = np.random.normal(i/410*5, size=30)
joypy.joyplot(df, overlap=2, colormap=cm.OrRd_r, linecolor='w', linewidth=.5)
for finer control of colors, you can define a color gradient function which accepts a fractional index and start and stop color tuples:
def color_gradient(x=0.0, start=(0, 0, 0), stop=(1, 1, 1)):
r = np.interp(x, [0, 1], [start[0], stop[0]])
g = np.interp(x, [0, 1], [start[1], stop[1]])
b = np.interp(x, [0, 1], [start[2], stop[2]])
return (r, g, b)
Usage:
joypy.joyplot(df, overlap=2, colormap=lambda x: color_gradient(x, start=(.78, .25, .09), stop=(1.0, .64, .44)), linecolor='w', linewidth=.5)
Examples with different start and stop tuples:
original answer:
You could iterate over your dataarrays you'd like to plot with plt.fill_between, setting colors to some gradient and the line color to white:
creating some sample data:
import numpy as np
t = np.linspace(-1.6, 1.6, 11)
y = np.cos(t)**2
y2 = lambda : y + np.random.random(len(y))/5-.1
plot the series:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
colors = cm.OrRd_r(np.linspace(.2, .6, 10))
plt.figure()
for i in range(10):
plt.fill_between(t+i, y2()+10-i/10, 10-i/10, facecolor = colors[i], edgecolor='w')
If you want it to have more optimized towards your example you should perhaps consider providing some sample data.
EDIT:
As I commented below, I'm not quite sure if I understand what you want - or if you want the best for your task. Therefore here a code which plots besides your approach in your edit two smples of how to present a bunch of histograms in a way that they are better comparable:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cm as cm
N = 10
np.random.seed(42)
colors=cm.OrRd_r(np.linspace(.2, .6, N))
fig1 = plt.figure()
x = np.arange(100)
for i in range(10):
y = np.random.rand(100)
plt.fill_between(x, y + 10-i, 10-i,
facecolor=colors[i],
edgecolor='w')
data = np.random.binomial(20, .3, (N, 100))
fig2, axs = plt.subplots(N, figsize=(10, 6))
for i, d in enumerate(data):
axs[i].hist(d, range(20), color=colors[i], label=str(i))
fig2.legend(loc='upper center', ncol=5)
fig3, ax = plt.subplots(figsize=(10, 6))
ax.hist(data.T, range(20), color=colors, label=[str(i) for i in range(N)])
fig3.legend(loc='upper center', ncol=5)
This leads to the following plots:
your plot from your edit:
N histograms in N subplots:
N histograms side by side in one plot:

Scatterplot in matplotlib with legend and randomized point order

I'm trying to build a scatterplot of a large amount of data from multiple classes in python/matplotlib. Unfortunately, it appears that I have to choose between having my data randomised and having legend labels. Is there a way I can have both (preferably without manually coding the labels?)
Minimum reproducible example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = np.random.normal(0, 1, [5000, 2])
Y = np.random.normal(0.5, 1, [5000, 2])
data = np.concatenate([X,Y])
classes = np.concatenate([np.repeat('X', X.shape[0]),
np.repeat('Y', Y.shape[0])])
Plotting with randomized points:
plot_idx = np.random.permutation(data.shape[0])
colors = pd.factorize(classes)
fig, ax = plt.subplots()
ax.scatter(data[plot_idx, 0],
data[plot_idx, 1],
c=colors[plot_idx],
label=classes[plot_idx],
alpha=0.4)
plt.legend()
plt.show()
This gives me the wrong legend.
Plotting with the correct legend:
from matplotlib import cm
unique_classes = np.unique(classes)
colors = cm.Set1(np.linspace(0, 1, len(unique_classes)))
for i, class in enumerate(unique_classes):
ax.scatter(data[classes == class, 0],
data[classes == class, 1],
c=colors[i],
label=class,
alpha=0.4)
plt.legend()
plt.show()
But now the points are not randomized and the resulting plot is not representative of the data.
I'm looking for something that would give me a result like I get as follows in R:
library(ggplot2)
X <- matrix(rnorm(10000, 0, 1), ncol=2)
Y <- matrix(rnorm(10000, 0.5, 1), ncol=2)
data <- as.data.frame(rbind(X, Y))
data$classes <- rep(c('X', 'Y'), times=nrow(X))
plot_idx <- sample(nrow(data))
ggplot(data[plot_idx,], aes(x=V1, y=V2, color=classes)) +
geom_point(alpha=0.4, size=3)
You need to create the legend manually. This is not a big problem though. You can loop over the labels and create a legend entry for each. Here one may use a Line2D with a marker similar to the scatter as handle.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = np.random.normal(0, 1, [5000, 2])
Y = np.random.normal(0.5, 1, [5000, 2])
data = np.concatenate([X,Y])
classes = np.concatenate([np.repeat('X', X.shape[0]),
np.repeat('Y', Y.shape[0])])
plot_idx = np.random.permutation(data.shape[0])
colors,labels = pd.factorize(classes)
fig, ax = plt.subplots()
sc = ax.scatter(data[plot_idx, 0],
data[plot_idx, 1],
c=colors[plot_idx],
alpha=0.4)
h = lambda c: plt.Line2D([],[],color=c, ls="",marker="o")
plt.legend(handles=[h(sc.cmap(sc.norm(i))) for i in range(len(labels))],
labels=list(labels))
plt.show()
Alternatively you can use a special scatter handler, as shown in the quesiton Why doesn't the color of the points in a scatter plot match the color of the points in the corresponding legend? but that seems a bit overkill here.
It's a bit of a hack, but you can save the axis limits, set the labels by drawing points well outside the limits of the plot, and then resetting the axis limits as follows:
plot_idx = np.random.permutation(data.shape[0])
color_idx, unique_classes = pd.factorize(classes)
colors = cm.Set1(np.linspace(0, 1, len(unique_classes)))
fig, ax = plt.subplots()
ax.scatter(data[plot_idx, 0],
data[plot_idx, 1],
c=colors[color_idx[plot_idx]],
alpha=0.4)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
for i in range(len(unique_classes)):
ax.scatter(xlim[1]*10,
ylim[1]*10,
c=colors[i],
label=unique_classes[i])
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.legend()
plt.show()

matplotlib pyplot colorbar question

Dear all, I'm trying to perform a scatter plot with color with an associated color bar. I would like the colorbar to have string values rather than numerical values, as I'm comparing two different data sets each one with different colorvalues (but in any case between a maximum and minimum values). Here the code I'm using
import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import *
from matplotlib import mpl
data = np.loadtxt('deltaBinned.txt')
data2 = np.loadtxt('deltaHalphaBinned.txt')
fig=plt.figure()
fig.subplots_adjust(bottom=0.1)
ax=fig.add_subplot(111)
plt.xlabel(r'$\partial \Delta/\partial\Phi[$mm$/^{\circ}]$',fontsize=16)
plt.ylabel(r'$\Delta$ [mm]',fontsize=16)
plt.scatter(data[:,0],data[:,1],marker='o',c=data[:,3],s=data[:,3]*1500,cmap=cm.Spectral,vmin=min(data[:,3]),vmax=max(data[:,3]))
plt.scatter(data2[:,0],data2[:,1],marker='^',c=data2[:,2],s=data2[:,2]*500,cmap=cm.Spectral,vmin=min(data2[:,2]),vmax=max(data2[:,2]))
cbar=plt.colorbar(ticks=[min(data2[:,2]),max(data2[:,2])])
cbar.set_ticks(['Low','High'])
cbar.set_label(r'PdF')
plt.show()
Unfortunately it does not work as cbar.set_ticks does not accept string values. I've read the ling
http://matplotlib.sourceforge.net/examples/pylab_examples/colorbar_tick_labelling_demo.html but Iwas not able to adapt it to my case. I apologize if the question is simple but I'm just at the beginning of python programming
Nicola.
cbar.ax.set_yticklabels(['Low','High'])
For example,
import numpy as np
import matplotlib.cm as cm
import matplotlib.pyplot as plt
data = np.random.random((10, 4))
data2 = np.random.random((10, 4))
plt.subplots_adjust(bottom = 0.1)
plt.xlabel(r'$\partial \Delta/\partial\Phi[$mm$/^{\circ}]$', fontsize = 16)
plt.ylabel(r'$\Delta$ [mm]', fontsize = 16)
plt.scatter(
data[:, 0], data[:, 1], marker = 'o', c = data[:, 3], s = data[:, 3]*1500,
cmap = cm.Spectral, vmin = min(data[:, 3]), vmax = max(data[:, 3]))
plt.scatter(
data2[:, 0], data2[:, 1], marker = '^', c = data2[:, 2], s = data2[:, 2]*500,
cmap = cm.Spectral, vmin = min(data2[:, 2]), vmax = max(data2[:, 2]))
cbar = plt.colorbar(ticks = [min(data2[:, 2]), max(data2[:, 2])])
cbar.ax.set_yticklabels(['Low', 'High'])
cbar.set_label(r'PdF')
plt.show()
produces

Categories