I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?
My first thought was to start at, for example, 5 and working my way up and calculating accuracy . However, for obvious reasons this wasn't a time effective means of calculating.
Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?
First look at the dataset distribution and then used explained_variance_ to find the number of components.
Start with projecting your samples on a 2-D graph.
Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
olivetti = fetch_olivetti_faces()
x = olivetti.images # Train
y = olivetti.target # Labels
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.3,
random_state=42)
x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2]))
x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2]))
x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])
Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.
from sklearn.decomposition import PCA
from matplotlib.pyplot import figure, get_cmap, colorbar, show
class_num = 40
sample_num = 10
pca = PCA(n_components=2).fit_transform(x)
idx_range = class_num * sample_num
fig = figure(figsize=(6, 3), dpi=300)
ax = fig.add_subplot(1, 1, 1)
c_map = get_cmap(name='jet', lut=class_num)
scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1],
c=y[:idx_range],s=10, cmap=c_map)
ax.set_xlabel("First Principal Component")
ax.set_ylabel("Second Principal Component")
ax.set_title("PCA projection of {} people".format(class_num))
colorbar(mappable=scatter)
show()
We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.
Please remember we created this graph from the main dataset, neither train nor test.
How are many principal components we need to clearly distinguish the data?
To answer the above question we will be using explained_variance_.
From the documentation:
The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.
from matplotlib.pyplot import plot, xlabel, ylabel
pca2 = PCA().fit(x)
plot(pca2.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
From the above graph, we can see after 100 components PCA distinguishes the people.
Simplified-code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()
Related
I have a KMeans function I made takes the input def kmeans(x,k, no_of_iterations): and returns the following return points, centroids it gets plotted perfectly, the code for that isn't very relevant. But I want to calculate for it, the silhouette score and graph this for each value.
#Load Data
data = load_digits().data
pca = PCA(2)
#Transform the data
df = pca.fit_transform(data)
X= df
#y = kmeans.fit_predict(X)
#Applying our function
label, centroids = kmeans(df,10,1000)#returns points value and centroids
y = label.fit_predict(data)
#Visualize the results
u_labels = np.unique(label)
for i in u_labels:
plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
the above is code for running the KMeans plot
Below is my attempt to calculate silhouette. This is from an example that imports from KMeans but I don't really want to do that nor did it work with my code.
silhouette_avg = silhouette_score(X, y)
print("The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, y)
You may notice that there is no value here for y, as I have found y is supposed to be the amount of clusters I think? So I had it as 10 at first and it give an error message. I don't know if from this code anyone could tell me what I do next to get this value?
Try this:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, shuffle=True, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof()
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
Also, see this.
https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.
I'm trying to reduce my components to 2 instead of 64 but I keep getting this error:
"Length mismatch: Expected axis has 64 elements, new values have 4 elements"
Why is the PCA I'm running on the data set not changing the number to 2?
This is what I have:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import scipy
from sklearn import decomposition
digits = datasets.load_digits() #load the digits dataset instead of the iris dataset
x = pd.DataFrame(digits.data) #was(iris.data)
x.columns = ['Sepal_L', 'Sepal_W', 'Sepal_L', 'Sepal_W']
plt.cla()
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
y = pd.DataFrame(digits.target)
y.columns = ['Targets']
# this line actually builds the machine learning model and runs the algorithm
# on the dataset
model = KMeans(n_clusters = 10) #Run k-means on this datatset to cluster the data into 10 classes
model.fit(x)
#print(model.labels_)
colormap = np.array(['red', 'blue', 'yellow', 'black'])
# Plot the Models Classifications
plt.subplot(1, 2, 2)
plt.scatter(x.Petal_L, x.Petal_W, c=colormap[model.labels_], s=40)
plt.title('K Means Classification')
plt.show()
It's not actually the PCA that is problematic, but just the renaming of your columns: the digits dataset has 64 columns, and you are trying to name the columns according to the column names for the 4 columns in the iris dataset.
Because of the nature of the digits dataset (pixels), there isn't really an appropriate naming scheme for the columns. So just don't rename them.
digits = datasets.load_digits()
x = pd.DataFrame(digits.data)
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
# Here is the result of your PCA (2 components)
>>> x
array([[ -1.25946636, 21.27488332],
[ 7.95761139, -20.76869904],
[ 6.99192268, -9.9559863 ],
...,
[ 10.80128366, -6.96025224],
[ -4.87210049, 12.42395326],
[ -0.34438966, 6.36554934]])
Then you can plot the first pc against the second, if that's what you're going for (what I gathered from your code)
plt.scatter(x[:,0], x[:,1], s=40)
plt.show()
I am playing with Non-Negative Matrix Factorisation (NMF) algorithm on MNIST database and I want to visualise how the first 15 components change when I reduce total number of components.
First I download the MNIST Database and transform it to numPy arrays
import numpy as np
from mnist import MNIST
mndata = MNIST('./data')
images_train, labels_train = mndata.load_training()
images_test, labels_test = mndata.load_testing()
labels_train = labels_train.tolist()
labels_test = labels_test.tolist()
X_train = np.array(images_train).astype('float64')
y_train = np.array(labels_train)
X_test = np.array(images_test).astype('float64')
y_test = np.array(labels_test)
Then I build the NMF model with 100 components
from sklearn.decomposition import NMF
nmf = NMF(n_components=100, random_state=0)
nmf.fit(X_train)
and I display first 15 components
import matplotlib.pyplot as plt
fig, axes = plt.subplots(3, 5, figsize=(12, 7), subplot_kw={'xticks': (), 'yticks': ()})
for i, (component, ax) in enumerate(zip(nmf.components_, axes.ravel())):
ax.imshow(component.reshape(28,28), cmap=plt.cm.binary)
ax.set_title("{}. component".format(i + 1))
Output:
Then I repeat the same process for 50 components
Output:
and lastly for 15 components
Output:
I wonder if the first 15 components of these NMF models changes smoothly when I reduce the number of total components. I would like to plot whole bunch of the pictures like the ones above and display them in an animation. More precisely I would like to animate them when number of total components goes from 100 to 15 with the step equal to 5. How can I do this?
After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:
#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]
X = np.array([clicks,conversion, bounce]).T
y = np.array(search)
num_clusters = 5
pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)
print data2D
>>> [[-0.07187948 -0.17784291]
[-0.07173769 -0.26868727]
[-0.07173789 -0.26867958]
...,
[-0.06942414 -0.25040886]
[-0.06950897 -0.19591147]
[-0.07172973 -0.2687937 ]]
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)
labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], marker='x', c='r')
plt.show()
The red crosses are the center of the clusters. Any help would be great.
Your ordering of PCA and KMeans is screwing things up...
Here is what you need to do:
Normalize your data.
Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
Normalize again
Cluster Data2D with KMeans
Plot the Centroids on top of Data2D.
Where as, here is what you have done above:
Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
Cluster the original data, X, in 5 dimensions.
Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.
Normalization:
Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)
A heuristic discussion that goes beyond your original question:
The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.
New Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
df.describe()
#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())
num_clusters = 5
pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)
# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var
#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
print "Data2D: "
print data2D
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)
labels=km.labels_
centers2D = km.cluster_centers_
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()
Plot:
Output:
Varience after PCA: [ 0.65725709 0.29875307]
Data2D:
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
...,
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration 0, inertia 51.225
Iteration 1, inertia 38.597
Iteration 2, inertia 36.837
...
...
Converged at iteration 31
Hope this helps!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()
Out[3]:
freq visit_length conversion_cnt
count 289705.0000 289705.0000 289705.0000
mean 0.2624 20.7598 0.0748
std 0.4399 55.0571 0.2631
min 0.0000 1.0000 0.0000
25% 0.0000 6.0000 0.0000
50% 0.0000 10.0000 0.0000
75% 1.0000 21.0000 0.0000
max 1.0000 2500.0000 1.0000
# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)
feature_names = df.columns
X_raw = df.values
transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_
Out[4]: array([ 9.9991e-01, 6.6411e-05])
# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)
labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')
KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.
Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.