How to extract and map cluster indices from sklearn.cluster.KMeans? - python

I have a map of data:
import seaborn as sns
import matplotlib.pyplot as plt
X = 101_by_99_float32_array
ax = sns.heatmap(X, square = True)
plt.show()
Note these data are essentially a 3D surface, and I'm interested in the index positions in X after clustering. I can easily apply the kmeans algorithm to my data:
from sklearn.cluster import KMeans
# three clusters is arbitrary; just used for testing purposes
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10).fit(X)
But I am not sure how to navigate kmeans in a way that will identify to which cluster a pixel in the map above belongs. What I want to do is make a map that looks like the one above, but instead of plotting the z-value for each cell in the 100x99 array X, I'd like to plot the cluster number for each cell in X.
I don't know if this is possible with the output of the kmeans algorithm, but I did try an approach from the scikitlearn documents here:
import numpy as np
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)
colors = ['#4EACC5', '#FF9C34', '#4E9A06']
plt.figure()
#plt.hold(True)
for k, col in zip(range(3), colors):
my_members = k_means_labels == k
cluster_center = k_means_cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('KMeans')
plt.show()
But it's clear this is not accessing the information I want...
It's obvious I do not fully understanding what each component of the kmeans output represents, and I've tried to read the explanations in the answer to the question found here. However, there's nothing in that answer that explicitly addresses whether the indices of the original data were preserved after clustering, which is really the core of my question. If such information is implicitly present in kmeans through some matrix multiplication, I could really use some help extracting it.
Thank you for your time and assistance!
EDIT:
Thanks to #Nakor, for both the explanation about kmeans and the suggestion to reshape my data. How kmeans is interpreting my data is now much clearer. I should not expect it to capture the indices of each sample, but instead rely on reshape to do so. reshape will ravel the original (101,99) matrix into (9999,1) array which, as #Nakor pointed out, is suitable for clustering every entry as an individual sample.
Simply reapply reshape to kmeans.labels_ using the original shape of the data and I've gotten the result I'm looking for:
Y = X.reshape(-1, 1) # shape data to cluster each individual entry
kmeans= KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(Y)
Z = kmeans.labels_
A = Z.reshape(101,99)
plt.figure()
ax = sns.heatmap(cu_map, square = True)
plt.figure()
ay = sns.heatmap(A, square = True)

Your issue is that sklearn.cluster.KMeans expects a 2D matrix with [N_samples,N_features]. However, you provide the raw image, so sklearn understands you have 101 samples with 99 features each (each row of your image is a sample, and the columns are the features). As a results, what you get in k_means.labels_ is the cluster assignment of each of the rows.
In you want instead to cluster every single entry, you need to reshape your data like this for instance:
model = KMeans(init='k-means++', n_clusters=3, n_init=10)
model.fit(X.reshape(-1,1))
If I check with randomly generated data, I get:
In [1]: len(model.labels_)
Out[1]: 9999
I have one label per entry.

Related

Principle Component Analysis, add a line to the 3d graph showing the first principal component

I am conducting PCA on a dataset. I am attempting to add a line in my 3d graph which shows the first principal component. I have tried a few methods but have not been able to display the first principal component as a line in my 3d graph. Any help is greatly appreciated. My code is as follows:
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
file_name = 'C:/Users/data'
input_data = pd.read_csv (file_name + '.csv', header=0, index_col=0)
A = input_data.A.values.astype(float)
B = input_data.B.values.astype(float)
C = input_data.C.values.astype(float)
D = input_data.D.values.astype(float)
E = input_data.E.values.astype(float)
F = input_data.F.values.astype(float)
X = np.column_stack((A, B, C, D, E, F))
ncompo = int (input ("Number of components to study: "))
print("")
pca = PCA (n_components = ncompo)
pcafit = pca.fit(X)
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
perc = pcafit.explained_variance_ratio_
perc_x = range(1, len(perc)+1)
plt.plot(perc_x, perc)
plt.xlabel('Components')
plt.ylabel('Percentage of Variance Explained')
plt.show()
#3d Graph
plt.clf()
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
colormap = np.array(['green', 'blue', 'red', 'yellow'])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')
plt.title('PCA')
plt.show()
Some remarks to begin with:
You are computing PCA twice! To compute PCA is to compute eigen values and eigen vectors of the covariance matrix. So, either you use the sklearn function pca.fit, either you do it yourself. But you don't need to do both, unless you want to discover pca.fit and see for yourself that it does exactly what you expect it to do (if this is what you wanted, fine. It is a good thing to do that king of checking. I did this once also). Of course pca.fit has another advantage: once you have it, it also provides pca.predict to project points in the components space. But that also is simply a base change using eigenvectors matrix (that is matrix to change base)
pca object let you get the eigenvectors (pca.components_) and eigen values (pca.explained_variance_)
pca.fit is a 'inplace' method. It does not return a new PCA object. It just fit the one you have. So, no need to get pcafit and use it.
This is not a minimal reproducible exemple as required on SO. We should be able to copy and paste it, and run it, so see exactly your problem. Not to guess what kind of secret data you have. And in the meantime, it should be minimal. So, contains data example generation (it doesn't matter if those data doesn't make sense. Sometimes it is even better, since it allows some testing. In my following code, I generate my own noisy data along an axis, which allow me to verify that, indeed, I am able to "guess" what was that axis). Plus, since your problem concerns only 3d plot, there is no need to include ploting of explained variance here. That part is not part of your question.
Now, to print the principal component, well, you already did the hard part. Twice. That is to compute it. It is the eigenvector associated with the highest eigenvalue.
With pca object no need to search for it, they are already sorted. So it is simply pca.components_[0]. And since you want to plot in the space D,E,F, you simply need to draw vector pca.components_[0][3:].
With correct scaling.
You can do that with plot providing just 2 points (first and last)
Here is my version (which, by the way, shows also what a minimal reproducible example is)
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
# Generation of random data along a given vector
vec=np.array([1, -1, 0.5, -0.5, 0.75, 0.75]).reshape(-1,1)
# 10000 random data, that are U[0,10]×vec + gaussian noise std=1
X=(vec*np.random.rand(10000)*10 + np.random.normal(0,1,(6,10000))).T
(A,B,C,D,E,F)=X.T
input_data = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E, 'F':F, 'Grade':np.random.randint(1,5, (10000,))})
ncompo=6
pca = PCA (n_components = ncompo)
pca.fit(X)
# Redundant
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# See
print("Eigen values")
print(eig_vals)
print(pca.explained_variance_)
print("Eigen vec")
print(eig_vecs)
print(pca.components_)
# Note, compare first components to
print("Main component")
print(vec/np.linalg.norm(vec))
print(pca.components_[0])
#3d Graph
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
fig = plt.figure()
colormap = np.array(['green', 'blue', 'red', 'yellow'])
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
U=pca.components_[0]
sc1=max(D)/U[3]
sc2=min(D)/U[3]
# Draw the 1st principal component as a blue line
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')
plt.title('PCA')
plt.show()
My example is not that minimal, because I took advantage of it to illustrate my first remark, and also computed PCA twice, to compare both result.
So, here I print, eigenvalues
Eigen values
[30.88941 1.01334 0.99512 0.96493 0.97692 0.98101]
[30.88941 1.01334 0.99512 0.98101 0.97692 0.96493]
(1st being your computation by diagonalisation of covariance matrix, 2nd pca.explained_variance_)
As you can see, they are the same, except sorting for the 1st one
Like wise,
Eigen vec
[[-0.52251 -0.27292 0.40863 -0.06321 0.26699 0.6405 ]
[ 0.52521 0.07577 -0.34211 0.27583 -0.04161 0.72357]
[-0.26266 -0.41332 -0.60091 0.38027 0.47573 -0.16779]
[ 0.26354 -0.52548 0.47284 0.59159 -0.24029 -0.15204]
[-0.39493 0.63946 0.07496 0.64966 -0.08619 0.00252]
[-0.3959 -0.25276 -0.35452 -0.0572 -0.79718 0.12217]]
[[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
[-0.27292 0.07577 -0.41332 -0.52548 0.63946 -0.25276]
[-0.40863 0.34211 0.60091 -0.47284 -0.07496 0.35452]
[-0.6405 -0.72357 0.16779 0.15204 -0.00252 -0.12217]
[-0.26699 0.04161 -0.47573 0.24029 0.08619 0.79718]
[-0.06321 0.27583 0.38027 0.59159 0.64966 -0.0572 ]]
Also the same, but for sorting and transpose.
Eigen vectors are presented column wise when you diagonalize a matrix.
Where as for pca.components_ each line is an eigen vector.
But you can see that in the 1st matrix, the eigen vector associated to the biggest eigen value, that is, since biggest eigen value was the 1st one, the 1st column (-0.52, 0.52, etc.)
is also the same as the first line of pca.components_.
Like wise, the 4th biggest eigen value in your diagonalisation was the last one.
And if you look at the last column of your eigen vectors (0.64, 0.72, -0.76...), it is the same as the 4th line of pca.components_ (with a irrelevant ×-1 factor)
So, long story short, you already have eigenvals in pca.explained_variance_ sorted from the biggest to the smallest. And eigen vectors in pca_components_, in the same order.
Last thing I print here, is comparison between the first component (pca.components_[0]) and the vector I used to generate the data in the first place (my data are all colinear to a vector vec, + a gaussian noise).
Main component
[[ 0.52523]
[-0.52523]
[ 0.26261]
[-0.26261]
[ 0.39392]
[ 0.39392]]
[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
As expected, PCA did find correctly that main axis.
So, that was just side comments.
What is really what you were looking for is
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
sc1 and sc2 being just scaling factors (here I choose it so that it scales approx like the data. Another way would have been to set ax.set_xlim, ax.set_ylim, ax.set_zlim from D.min(), D.max(), E.min(), E.max(), etc.
And then just use big values for sc1 and sc2, like
sc1=1000
sc2=-1000

Output K-Means to CSV with SciKit Learn - give cluster names

I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:

How to overplot arrays of different shape?

I'm trying to overplot two arrays with different shapes but I'm unable to project one on the top of the other. For example:
#importing the relevant packages
import numpy as np
import matplotlib.pyplot as plt
def overplot(data1,data2):
'''
This function should make a contour plot
of data2 over the data1 plot.
'''
#creating the figure
fig = plt.figure()
#adding an axe
ax = fig.add_axes([1,1,1,1])
#making the plot for the
#first dataset
ax.imshow(data1)
#overplotting the contours
#for the second dataset
ax.contour(data2, projection = data2,
levels = [0.5,0.7])
#showing the figure
plt.show(fig)
return
if __name__ == '__main__':
'''
testing zone
'''
#creating two mock datasets
data1 = np.random.rand(3,3)
data2 = np.random.rand(9,9)
#using the overplot
overplot(data1,data2)
Currently, my output is something like:
While what I actually would like is to project the contours of the second dataset into the first one. This way, if I got images of the same object but with different resolution for the cameras I would be able to do such plots. How can I do that?
Thanks for your time and attention.
It's generally best to make the data match, and then plot it. This way you have complete control over how things are done.
In the simple example you give, you could use repeat along each axis to expand the 3x3 data to match the 9x9 data. That is, you could use, data1b = np.repeat(np.repeat(data1, 3, axis=1), 3, axis=0) to give:
But for the more interesting case of images, like you mention at the end of your question, then the axes probably won't be integer multiples and you'll be better served by a spline or other type interpolation. This difference is an example of why it's better to have control over this yourself, since there are many ways to to this type of mapping.

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.
You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]

Sklearn.KMeans() : Get class centroid labels and reference to a dataset

Sci-Kit learn Kmeans and PCA dimensionality reduction
I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.
date,
Global_active_power,
Global_reactive_power,
Voltage,
Global_intensity,
Sub_metering_1,
Sub_metering_2,
Sub_metering_3
I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()
I use K-means classification followed by PCA dimensionality reduction to display.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Now I would like to find out which rows fell under a given class then which dates fell under a given class.
Is there any way to relate the points on the graph to an index in my
dataset, after PCA?
Some method I don't know of?
Or is my approach fundamentally flawed?
Any recommendations?
I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .
My goal is to classify the data and then get the dates that fall under a class.
Thank You
KMeans().predict(X) ..docs here
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: (New data to predict)
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns: (Index of the cluster each sample belongs to)
labels : array, shape [n_samples,]
The problem I with the code you submitted is the use of
train_test_split()
which returns two arrays of random rows in your data-set, effectively ruining your dataset order making it difficult to correlate the labels returned from KMeans classification to sequential dates in your data set.
Here's an example:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)
#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()
#convert dataframe to data array and removes date column not to be processed,
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values
k_means = KMeans()
k_means.fit(hpc)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_
#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
Now you can see your result matched with your data-set on the right side.
Now that it's classified, it's up to you to derive meaning.
This is just a good overall example of how it can be used, from start to finish.
Displaying your result, look at PCA or making other graphs dependent on class.

Categories