Clustering data with Python based on their correlation - python

I would like to cluster the following set of data in two clusters corresponding to each line ("\" and "/" ) of the "X". I was thinking that it could be done using the Pearson correlation coefficients as distance metric in Scikit-learn Agglomerative clustering as indicated here (How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering). But it doesn't seem to work.
Plot of the raw data
Data:
-6.5955882 11.344538
-6.1911765 12.027311
-5.4191176 10.346639
-4.7573529 7.5105042
-2.9191176 7.7205882
-1.5955882 6.6176471
-2.9558824 6.039916
-1.1544118 3.9915966
-0.088235294 4.7794118
-0.088235294 2.8361345
0.53676471 -1.2079832
2.7794118 0
3.4044118 -4.3592437
5.2794118 -3.9915966
6.75 -8.5609244
7.4485294 -6.8802521
5.1691176 -5.7247899
-7.1470588 -2.8361345
-6.7058824 -1.2605042
-4.4264706 -1.1554622
-3.5073529 0.78781513
-0.86029412 0.31512605
-1.0808824 2.1533613
-2.8823529 -0.42016807
1.0514706 2.2584034
1.9338235 4.4117647
4.6544118 5.5147059
3.7352941 7.0378151
6.0147059 8.2457983
7.0808824 7.7205882
The code I've tried:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import pearsonr
nc=2
data = np.loadtxt("cross-data_2.dat")
plt.scatter(data[:,0], data[:,1], s=100, cmap='viridis')
def pearson_affinity(M):
return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])
hc = AgglomerativeClustering(n_clusters=nc, affinity = pearson_affinity, linkage = 'average')
y_hc = hc.fit_predict(data)
plt.figure()
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=100, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=100, c='black')
plt.show()
The results of the clustering:
Is there something wrong in the code or should I simply use another method?

I propose yet another method for this, Gaussian Mixture Models.
X = (your data)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2,
init_params='random',
n_init=5,
random_state=123)
y_pred = gmm.fit_predict(X)
plt.scatter(*X.T, c=y_pred)

I can propose a alternative method to achieve this. Since, you are trying to cluster points along same angle, we can first transform data to polar (r-theta) coordinates and then use simple KMeans clustering.
r = np.sqrt(x[:, 0]**2 + x[:, 1]**2)
theta = np.arctan(x[:, 1]/x[:, 0])
xr = np.vstack((r*np.sin(theta), r*np.cos(theta))).T
from sklearn.cluster import KMeans
km = KMeans(2)
xx = km.fit_predict(xr)
plt.scatter(x[:, 0], x[:, 1], c=xx)

Related

How could I use a dynamic espilon in a DBSCAN?

Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.

How to give sns.clustermap a precomputed distance matrix?

Usually when I do dendrograms and heatmaps, I use a distance matrix and do a bunch of SciPy stuff. I want to try out Seaborn but Seaborn wants my data in rectangular form (rows=samples, cols=attributes, not a distance matrix)?
I essentially want to use seaborn as the backend to compute my dendrogram and tack it on to my heatmap. Is this possible? If not, can this be a feature in the future.
Maybe there are parameters I can adjust so it can take a distance matrix instead of a rectangular matrix?
Here's the usage:
seaborn.clustermap¶
seaborn.clustermap(data, pivot_kws=None, method='average', metric='euclidean',
z_score=None, standard_scale=None, figsize=None, cbar_kws=None, row_cluster=True,
col_cluster=True, row_linkage=None, col_linkage=None, row_colors=None,
col_colors=None, mask=None, **kwargs)
My code below:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
DF = pd.DataFrame(X, index = ["iris_%d" % (i) for i in range(X.shape[0])], columns = iris.feature_names)
I don't think my method is correct below because I'm giving it a precomputed distance matrix and NOT a rectangular data matrix as it requests. There's no examples of how to use a correlation/distance matrix with clustermap but there is for https://stanford.edu/~mwaskom/software/seaborn/examples/network_correlations.html but the ordering is not clustered w/ the plain sns.heatmap func.
DF_corr = DF.T.corr()
DF_dism = 1 - DF_corr
sns.clustermap(DF_dism)
You can pass the precomputed distance matrix as linkage to clustermap():
import pandas as pd, seaborn as sns
import scipy.spatial as sp, scipy.cluster.hierarchy as hc
from sklearn.datasets import load_iris
sns.set(font="monospace")
iris = load_iris()
X, y = iris.data, iris.target
DF = pd.DataFrame(X, index = ["iris_%d" % (i) for i in range(X.shape[0])], columns = iris.feature_names)
DF_corr = DF.T.corr()
DF_dism = 1 - DF_corr # distance matrix
linkage = hc.linkage(sp.distance.squareform(DF_dism), method='average')
sns.clustermap(DF_dism, row_linkage=linkage, col_linkage=linkage)
For clustermap(distance_matrix) (i.e., without linkage passed), the linkage is calculated internally based on pairwise distances of the rows and columns in the distance matrix (see note below for full details) instead of using the elements of the distance matrix directly (the correct solution). As a result, the output is somewhat different from the one in the question:
Note: if no row_linkage is passed to clustermap(), the row linkage is determined internally by considering each row a "point" (observation) and calculating the pairwise distances between the points. So the row dendrogram reflects row similarity. Analogous for col_linkage, where each column is considered a point. This explanation should likely be added to the docs. Here the docs's first example modified to make the internal linkage calculation explicit:
import seaborn as sns; sns.set()
import scipy.spatial as sp, scipy.cluster.hierarchy as hc
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
row_linkage, col_linkage = (hc.linkage(sp.distance.pdist(x), method='average')
for x in (flights.values, flights.values.T))
g = sns.clustermap(flights, row_linkage=row_linkage, col_linkage=col_linkage)
# note: this produces the same plot as "sns.clustermap(flights)", where
# clustermap() calculates the row and column linkages internally

k-means and MiniBatchKMean in python is out of memory

I am using sklearn.cluster "KMean" with python. I am trying to cluster data and plot graph for vectors(size =109977)
Here is my code:
model = gensim.models.Word2Vec.load("../wordvectors_300dimeansion.model")
vectors = model.syn0 #size 109977
n_clusters_kmeans = 20 # more for visualization 100 better for clustering
#kmeans = KMeans(init='k-means++', n_clusters=n_clusters_kmeans, n_init=10)
min_kmeans = MiniBatchKMeans(init='k-means++', n_clusters=n_clusters_kmeans, n_init=10)
min_kmeans.fit(vectors)
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(vectors)
X_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_reduced)
target = min_kmeans.labels_
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.9, wspace=0.0, hspace=0.0)
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=target, marker="x")
plt.show()
kMean computing pairwise distance means to create a matrix with pairwise distances of 109977X109977
And I think due to this I got "out of memory" error.
pl. suggest me some scalable k-means clustering algorithm for large dataset.
Is there any tool available for this purpose? which I can import in my program directly as I did for sklearn.cluster "KMean" and "MiniBatchKMean", so that I can directly import it in my program.
Thanks.

PCA output looks weird for a kmeans scatter plot

After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:
#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]
X = np.array([clicks,conversion, bounce]).T
y = np.array(search)
num_clusters = 5
pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)
print data2D
>>> [[-0.07187948 -0.17784291]
[-0.07173769 -0.26868727]
[-0.07173789 -0.26867958]
...,
[-0.06942414 -0.25040886]
[-0.06950897 -0.19591147]
[-0.07172973 -0.2687937 ]]
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)
labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], marker='x', c='r')
plt.show()
The red crosses are the center of the clusters. Any help would be great.
Your ordering of PCA and KMeans is screwing things up...
Here is what you need to do:
Normalize your data.
Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
Normalize again
Cluster Data2D with KMeans
Plot the Centroids on top of Data2D.
Where as, here is what you have done above:
Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
Cluster the original data, X, in 5 dimensions.
Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.
Normalization:
Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)
A heuristic discussion that goes beyond your original question:
The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.
New Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
df.describe()
#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())
num_clusters = 5
pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)
# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var
#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
print "Data2D: "
print data2D
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)
labels=km.labels_
centers2D = km.cluster_centers_
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()
Plot:
Output:
Varience after PCA: [ 0.65725709 0.29875307]
Data2D:
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
...,
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration 0, inertia 51.225
Iteration 1, inertia 38.597
Iteration 2, inertia 36.837
...
...
Converged at iteration 31
Hope this helps!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()
Out[3]:
freq visit_length conversion_cnt
count 289705.0000 289705.0000 289705.0000
mean 0.2624 20.7598 0.0748
std 0.4399 55.0571 0.2631
min 0.0000 1.0000 0.0000
25% 0.0000 6.0000 0.0000
50% 0.0000 10.0000 0.0000
75% 1.0000 21.0000 0.0000
max 1.0000 2500.0000 1.0000
# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)
feature_names = df.columns
X_raw = df.values
transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_
Out[4]: array([ 9.9991e-01, 6.6411e-05])
# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)
labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')
KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.
Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.

Sklearn.KMeans() : Get class centroid labels and reference to a dataset

Sci-Kit learn Kmeans and PCA dimensionality reduction
I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.
date,
Global_active_power,
Global_reactive_power,
Voltage,
Global_intensity,
Sub_metering_1,
Sub_metering_2,
Sub_metering_3
I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()
I use K-means classification followed by PCA dimensionality reduction to display.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Now I would like to find out which rows fell under a given class then which dates fell under a given class.
Is there any way to relate the points on the graph to an index in my
dataset, after PCA?
Some method I don't know of?
Or is my approach fundamentally flawed?
Any recommendations?
I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .
My goal is to classify the data and then get the dates that fall under a class.
Thank You
KMeans().predict(X) ..docs here
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: (New data to predict)
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns: (Index of the cluster each sample belongs to)
labels : array, shape [n_samples,]
The problem I with the code you submitted is the use of
train_test_split()
which returns two arrays of random rows in your data-set, effectively ruining your dataset order making it difficult to correlate the labels returned from KMeans classification to sequential dates in your data set.
Here's an example:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)
#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()
#convert dataframe to data array and removes date column not to be processed,
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values
k_means = KMeans()
k_means.fit(hpc)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_
#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
Now you can see your result matched with your data-set on the right side.
Now that it's classified, it's up to you to derive meaning.
This is just a good overall example of how it can be used, from start to finish.
Displaying your result, look at PCA or making other graphs dependent on class.

Categories