After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:
#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]
X = np.array([clicks,conversion, bounce]).T
y = np.array(search)
num_clusters = 5
pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)
print data2D
>>> [[-0.07187948 -0.17784291]
[-0.07173769 -0.26868727]
[-0.07173789 -0.26867958]
...,
[-0.06942414 -0.25040886]
[-0.06950897 -0.19591147]
[-0.07172973 -0.2687937 ]]
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)
labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], marker='x', c='r')
plt.show()
The red crosses are the center of the clusters. Any help would be great.
Your ordering of PCA and KMeans is screwing things up...
Here is what you need to do:
Normalize your data.
Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
Normalize again
Cluster Data2D with KMeans
Plot the Centroids on top of Data2D.
Where as, here is what you have done above:
Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
Cluster the original data, X, in 5 dimensions.
Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.
Normalization:
Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)
A heuristic discussion that goes beyond your original question:
The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.
New Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
df.describe()
#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())
num_clusters = 5
pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)
# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var
#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
print "Data2D: "
print data2D
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)
labels=km.labels_
centers2D = km.cluster_centers_
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()
Plot:
Output:
Varience after PCA: [ 0.65725709 0.29875307]
Data2D:
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
...,
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration 0, inertia 51.225
Iteration 1, inertia 38.597
Iteration 2, inertia 36.837
...
...
Converged at iteration 31
Hope this helps!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()
Out[3]:
freq visit_length conversion_cnt
count 289705.0000 289705.0000 289705.0000
mean 0.2624 20.7598 0.0748
std 0.4399 55.0571 0.2631
min 0.0000 1.0000 0.0000
25% 0.0000 6.0000 0.0000
50% 0.0000 10.0000 0.0000
75% 1.0000 21.0000 0.0000
max 1.0000 2500.0000 1.0000
# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)
feature_names = df.columns
X_raw = df.values
transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_
Out[4]: array([ 9.9991e-01, 6.6411e-05])
# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)
labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')
KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.
Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.
Related
I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?
My first thought was to start at, for example, 5 and working my way up and calculating accuracy . However, for obvious reasons this wasn't a time effective means of calculating.
Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?
First look at the dataset distribution and then used explained_variance_ to find the number of components.
Start with projecting your samples on a 2-D graph.
Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
olivetti = fetch_olivetti_faces()
x = olivetti.images # Train
y = olivetti.target # Labels
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.3,
random_state=42)
x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2]))
x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2]))
x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])
Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.
from sklearn.decomposition import PCA
from matplotlib.pyplot import figure, get_cmap, colorbar, show
class_num = 40
sample_num = 10
pca = PCA(n_components=2).fit_transform(x)
idx_range = class_num * sample_num
fig = figure(figsize=(6, 3), dpi=300)
ax = fig.add_subplot(1, 1, 1)
c_map = get_cmap(name='jet', lut=class_num)
scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1],
c=y[:idx_range],s=10, cmap=c_map)
ax.set_xlabel("First Principal Component")
ax.set_ylabel("Second Principal Component")
ax.set_title("PCA projection of {} people".format(class_num))
colorbar(mappable=scatter)
show()
We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.
Please remember we created this graph from the main dataset, neither train nor test.
How are many principal components we need to clearly distinguish the data?
To answer the above question we will be using explained_variance_.
From the documentation:
The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.
from matplotlib.pyplot import plot, xlabel, ylabel
pca2 = PCA().fit(x)
plot(pca2.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
From the above graph, we can see after 100 components PCA distinguishes the people.
Simplified-code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()
Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.
I would like to cluster the following set of data in two clusters corresponding to each line ("\" and "/" ) of the "X". I was thinking that it could be done using the Pearson correlation coefficients as distance metric in Scikit-learn Agglomerative clustering as indicated here (How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering). But it doesn't seem to work.
Plot of the raw data
Data:
-6.5955882 11.344538
-6.1911765 12.027311
-5.4191176 10.346639
-4.7573529 7.5105042
-2.9191176 7.7205882
-1.5955882 6.6176471
-2.9558824 6.039916
-1.1544118 3.9915966
-0.088235294 4.7794118
-0.088235294 2.8361345
0.53676471 -1.2079832
2.7794118 0
3.4044118 -4.3592437
5.2794118 -3.9915966
6.75 -8.5609244
7.4485294 -6.8802521
5.1691176 -5.7247899
-7.1470588 -2.8361345
-6.7058824 -1.2605042
-4.4264706 -1.1554622
-3.5073529 0.78781513
-0.86029412 0.31512605
-1.0808824 2.1533613
-2.8823529 -0.42016807
1.0514706 2.2584034
1.9338235 4.4117647
4.6544118 5.5147059
3.7352941 7.0378151
6.0147059 8.2457983
7.0808824 7.7205882
The code I've tried:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import pearsonr
nc=2
data = np.loadtxt("cross-data_2.dat")
plt.scatter(data[:,0], data[:,1], s=100, cmap='viridis')
def pearson_affinity(M):
return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])
hc = AgglomerativeClustering(n_clusters=nc, affinity = pearson_affinity, linkage = 'average')
y_hc = hc.fit_predict(data)
plt.figure()
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=100, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=100, c='black')
plt.show()
The results of the clustering:
Is there something wrong in the code or should I simply use another method?
I propose yet another method for this, Gaussian Mixture Models.
X = (your data)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2,
init_params='random',
n_init=5,
random_state=123)
y_pred = gmm.fit_predict(X)
plt.scatter(*X.T, c=y_pred)
I can propose a alternative method to achieve this. Since, you are trying to cluster points along same angle, we can first transform data to polar (r-theta) coordinates and then use simple KMeans clustering.
r = np.sqrt(x[:, 0]**2 + x[:, 1]**2)
theta = np.arctan(x[:, 1]/x[:, 0])
xr = np.vstack((r*np.sin(theta), r*np.cos(theta))).T
from sklearn.cluster import KMeans
km = KMeans(2)
xx = km.fit_predict(xr)
plt.scatter(x[:, 0], x[:, 1], c=xx)
I'm trying to reduce my components to 2 instead of 64 but I keep getting this error:
"Length mismatch: Expected axis has 64 elements, new values have 4 elements"
Why is the PCA I'm running on the data set not changing the number to 2?
This is what I have:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import scipy
from sklearn import decomposition
digits = datasets.load_digits() #load the digits dataset instead of the iris dataset
x = pd.DataFrame(digits.data) #was(iris.data)
x.columns = ['Sepal_L', 'Sepal_W', 'Sepal_L', 'Sepal_W']
plt.cla()
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
y = pd.DataFrame(digits.target)
y.columns = ['Targets']
# this line actually builds the machine learning model and runs the algorithm
# on the dataset
model = KMeans(n_clusters = 10) #Run k-means on this datatset to cluster the data into 10 classes
model.fit(x)
#print(model.labels_)
colormap = np.array(['red', 'blue', 'yellow', 'black'])
# Plot the Models Classifications
plt.subplot(1, 2, 2)
plt.scatter(x.Petal_L, x.Petal_W, c=colormap[model.labels_], s=40)
plt.title('K Means Classification')
plt.show()
It's not actually the PCA that is problematic, but just the renaming of your columns: the digits dataset has 64 columns, and you are trying to name the columns according to the column names for the 4 columns in the iris dataset.
Because of the nature of the digits dataset (pixels), there isn't really an appropriate naming scheme for the columns. So just don't rename them.
digits = datasets.load_digits()
x = pd.DataFrame(digits.data)
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
# Here is the result of your PCA (2 components)
>>> x
array([[ -1.25946636, 21.27488332],
[ 7.95761139, -20.76869904],
[ 6.99192268, -9.9559863 ],
...,
[ 10.80128366, -6.96025224],
[ -4.87210049, 12.42395326],
[ -0.34438966, 6.36554934]])
Then you can plot the first pc against the second, if that's what you're going for (what I gathered from your code)
plt.scatter(x[:,0], x[:,1], s=40)
plt.show()
Sci-Kit learn Kmeans and PCA dimensionality reduction
I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.
date,
Global_active_power,
Global_reactive_power,
Voltage,
Global_intensity,
Sub_metering_1,
Sub_metering_2,
Sub_metering_3
I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()
I use K-means classification followed by PCA dimensionality reduction to display.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Now I would like to find out which rows fell under a given class then which dates fell under a given class.
Is there any way to relate the points on the graph to an index in my
dataset, after PCA?
Some method I don't know of?
Or is my approach fundamentally flawed?
Any recommendations?
I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .
My goal is to classify the data and then get the dates that fall under a class.
Thank You
KMeans().predict(X) ..docs here
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: (New data to predict)
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns: (Index of the cluster each sample belongs to)
labels : array, shape [n_samples,]
The problem I with the code you submitted is the use of
train_test_split()
which returns two arrays of random rows in your data-set, effectively ruining your dataset order making it difficult to correlate the labels returned from KMeans classification to sequential dates in your data set.
Here's an example:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)
#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()
#convert dataframe to data array and removes date column not to be processed,
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values
k_means = KMeans()
k_means.fit(hpc)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_
#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
Now you can see your result matched with your data-set on the right side.
Now that it's classified, it's up to you to derive meaning.
This is just a good overall example of how it can be used, from start to finish.
Displaying your result, look at PCA or making other graphs dependent on class.