Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.
Related
Based on the guide Implementing PCA in Python, by Sebastian Raschka I am building the PCA algorithm from scratch for my research purpose. The class definition is:
import numpy as np
class PCA(object):
"""Dimension Reduction using Principal Component Analysis (PCA)
It is the procces of computing principal components which explains the
maximum variation of the dataset using fewer components.
:type n_components: int, optional
:param n_components: Number of components to consider, if not set then
`n_components = min(n_samples, n_features)`, where
`n_samples` is the number of samples, and
`n_features` is the number of features (i.e.,
dimension of the dataset).
Attributes
==========
:type covariance_: np.ndarray
:param covariance_: Coviarance Matrix
:type eig_vals_: np.ndarray
:param eig_vals_: Calculated Eigen Values
:type eig_vecs_: np.ndarray
:param eig_vecs_: Calculated Eigen Vectors
:type explained_variance_: np.ndarray
:param explained_variance_: Explained Variance of Each Principal Components
:type cum_explained_variance_: np.ndarray
:param cum_explained_variance_: Cumulative Explained Variables
"""
def __init__(self, n_components : int = None):
"""Default Constructor for Initialization"""
self.n_components = n_components
def fit_transform(self, X : np.ndarray):
"""Fit the PCA algorithm into the Dataset"""
if not self.n_components:
self.n_components = min(X.shape)
self.covariance_ = np.cov(X.T)
# calculate eigens
self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)
# explained variance
_tot_eig_vals = sum(self.eig_vals_)
self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
self.cum_explained_variance_ = np.cumsum(self.explained_variance_)
# define `W` as `d x k`-dimension
self.W_ = self.eig_vecs_[:, :self.n_components]
print(X.shape, self.W_.shape)
return X.dot(self.W_)
Consider the iris-dataset as a test case, PCA is achieved and visualized as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)
# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()
The output is as:
Now, I wanted to verify the output, for which I used sklearn library, and the output is as follows:
from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components
principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()
I don't understand why the output is oriented differently, with a minor different value. I studied numerous codes [1, 2, 3], all of which have the same issue. My questions:
What is different in sklearn, that the plot is different? I've tried with a different dataset too - the same problem.
Is there a way to fix this issue?
I was not able to study the sklearn.decompose.PCA algorithm, as I am new to OOPs concept with python.
Output in the blog post by Sebastian Raschka also has a minor variation in output. Figure below:
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
The difference in values comes from PCA from sklearn using svd decomposition. In sklearn there's a function svd_flip used to flip the PCs, which explains why you see this flip
More details on the help page:
It uses the LAPACK implementation of the full SVD or a randomized
truncated SVD by the method of Halko et al. 2009, depending on the
shape of the input data and the number of components to extract.
You can read about the relation here
We first run your example dataset:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy
iris = load_iris()
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
n_components = 4
sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))
We now perform SVD on your centered matrix:
U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)
svdPCs = pd.DataFrame(U*S)
The results:
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
svdPCs
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
You can implement without the flip. The values will be the same and your PCA will be valid as noted in the other answer.
I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?
My first thought was to start at, for example, 5 and working my way up and calculating accuracy . However, for obvious reasons this wasn't a time effective means of calculating.
Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?
First look at the dataset distribution and then used explained_variance_ to find the number of components.
Start with projecting your samples on a 2-D graph.
Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
olivetti = fetch_olivetti_faces()
x = olivetti.images # Train
y = olivetti.target # Labels
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.3,
random_state=42)
x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2]))
x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2]))
x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])
Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.
from sklearn.decomposition import PCA
from matplotlib.pyplot import figure, get_cmap, colorbar, show
class_num = 40
sample_num = 10
pca = PCA(n_components=2).fit_transform(x)
idx_range = class_num * sample_num
fig = figure(figsize=(6, 3), dpi=300)
ax = fig.add_subplot(1, 1, 1)
c_map = get_cmap(name='jet', lut=class_num)
scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1],
c=y[:idx_range],s=10, cmap=c_map)
ax.set_xlabel("First Principal Component")
ax.set_ylabel("Second Principal Component")
ax.set_title("PCA projection of {} people".format(class_num))
colorbar(mappable=scatter)
show()
We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.
Please remember we created this graph from the main dataset, neither train nor test.
How are many principal components we need to clearly distinguish the data?
To answer the above question we will be using explained_variance_.
From the documentation:
The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.
from matplotlib.pyplot import plot, xlabel, ylabel
pca2 = PCA().fit(x)
plot(pca2.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
From the above graph, we can see after 100 components PCA distinguishes the people.
Simplified-code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()
I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.
For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
[[ 0.00217333]
[ 0.00223798]]
Doing the same thing for different sample sizes:
This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?
I think that it may be related to the tolerance of KMeans. The default value is 1e-4, so setting a lower value, i.e. tol=1e-8 gives:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
0
label
0 9.99200722e-16
1 1.11022302e-16
Hope it helps.
I have a dataset similar to the one shown below, that clearly forms lines from my point of view. Instead of drawing markers, I want to connect the markers within each curve by a line. I am curious, in this case, what type of clustering algorithms would be a good one?
import numpy as np
import matplotlib.pyplot as plt
np.random.seed = 42
#Generate (x,y) data
x = np.linspace(0.1,0.9,50)
y = x%1
x += np.sin(2*x%1)
y = y%0.2
#Shuffle (x,y) data
ns = list(range(len(x)))
np.random.shuffle(ns)
x = x[ns]
y = y[ns]
#Plot
fig, axs = plt.subplots(1,2)
axs[0].scatter(x,y)
axs[1].plot(x,y)
plt.savefig("markers vs lines.pdf")
Figure - Left: Markers, Right: Data points connected by lines.
Since you asked for Clustering Algorithm, you may want to look at DBSCAN.
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
There is two parameters, epsilon and the number of point to make a cluster.
Here is a code to get you started:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
np.random.seed = 42
%matplotlib inline
#Generate (x,y) data
x = np.linspace(0.1,0.9,50)
y = x%1
x += np.sin(2*x%1)
y = y%0.2
#Shuffle (x,y) data
ns = list(range(len(x)))
np.random.shuffle(ns)
x = x[ns]
y = y[ns]
"""
Fit the Data
"""
X = [i for i in zip(x,y)]
X = StandardScaler().fit_transform(X)
"""
Compute the DBSCAN
"""
db = DBSCAN(eps=0.5, min_samples=1).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_
"""
Plot the clusters
"""
d= dict(zip(set(labels),['red','green','blue','yellow','purple','grey']))
d[-1] = "black"
plt.scatter(x,y,color=[ d[i] for i in labels])
plt.show()
The result :
Inpired by : http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
More about the parameters of the DBSCAN here : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
Hope this help.
Such data is common in image analysis, due to architecture.
In order to infer perspective, people have used the Hough transform to identify lines of points'.
That is probably the best method to use here.
After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:
#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]
X = np.array([clicks,conversion, bounce]).T
y = np.array(search)
num_clusters = 5
pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)
print data2D
>>> [[-0.07187948 -0.17784291]
[-0.07173769 -0.26868727]
[-0.07173789 -0.26867958]
...,
[-0.06942414 -0.25040886]
[-0.06950897 -0.19591147]
[-0.07172973 -0.2687937 ]]
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)
labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], marker='x', c='r')
plt.show()
The red crosses are the center of the clusters. Any help would be great.
Your ordering of PCA and KMeans is screwing things up...
Here is what you need to do:
Normalize your data.
Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
Normalize again
Cluster Data2D with KMeans
Plot the Centroids on top of Data2D.
Where as, here is what you have done above:
Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
Cluster the original data, X, in 5 dimensions.
Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.
Normalization:
Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)
A heuristic discussion that goes beyond your original question:
The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.
New Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
df.describe()
#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())
num_clusters = 5
pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)
# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var
#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
print "Data2D: "
print data2D
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)
labels=km.labels_
centers2D = km.cluster_centers_
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()
Plot:
Output:
Varience after PCA: [ 0.65725709 0.29875307]
Data2D:
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
...,
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration 0, inertia 51.225
Iteration 1, inertia 38.597
Iteration 2, inertia 36.837
...
...
Converged at iteration 31
Hope this helps!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()
Out[3]:
freq visit_length conversion_cnt
count 289705.0000 289705.0000 289705.0000
mean 0.2624 20.7598 0.0748
std 0.4399 55.0571 0.2631
min 0.0000 1.0000 0.0000
25% 0.0000 6.0000 0.0000
50% 0.0000 10.0000 0.0000
75% 1.0000 21.0000 0.0000
max 1.0000 2500.0000 1.0000
# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)
feature_names = df.columns
X_raw = df.values
transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_
Out[4]: array([ 9.9991e-01, 6.6411e-05])
# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)
labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')
KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.
Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.