sklearn's KMeans: Cluster centers and cluster means differ. Numerical Imprecision? - python

I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.
For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
[[ 0.00217333]
[ 0.00223798]]
Doing the same thing for different sample sizes:
This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?

I think that it may be related to the tolerance of KMeans. The default value is 1e-4, so setting a lower value, i.e. tol=1e-8 gives:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
0
label
0 9.99200722e-16
1 1.11022302e-16
Hope it helps.

Related

Kernel Density Estimation using scipy's gaussian_kde and sklearn's KernelDensity leads to different results

I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.
Below you can find the code to reproduce the issue:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.
What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:
the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample
For your estimation this probably means that your standard deviation equals 4.
I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.
To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.
So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method='scott') ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Increase the size of 'random normal'. your data points are too few.
try with n=500000 and check the results.

sihouette score returns inconsistent number of sample

I am using scikit's silhouette_score hierarchical clustering. I am not from data science background, or python. However i do know some other languages and do know how hierarchical clustering logic works. i was told to use the scikit's silhouette_score to calculate the silhouette score. this code returns an error of
ValueError: Found input variables with inconsistent numbers of samples: [149, 150]
The data used is csv, containing 151 rows with the first row as the data's type. So in total there is 150 datas.
here is my code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.metrics import silhouette_score
iris = pd.read_csv("Iris.csv")
#iris hierarichal
iris_df = iris.iloc[:, 1:5]
plt.figure(figsize=(10, 7))
plt.title("Iris Dendograms Average Method")
link = linkage(iris_df, method='average')
dend = dendrogram(link)
plt.show()
clusters = fcluster(link, 3, criterion='maxclust')
print(silhouette_score(link, clusters))
You've got a problem here:
print(silhouette_score(link, clusters))
Change it and you're fine to go:
print(silhouette_score(X, clusters))
Please see docs for silhouette_score:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise.
Array of pairwise distances between samples, or a feature array.

How could I use a dynamic espilon in a DBSCAN?

Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.

Custom binning in sklearn.preprocessing?

I have a list of continuous variables called size_array. I've been scaling them from [0, 1] like this:
max_abs_scaler = preprocessing.MinMaxScaler()
scaled = max_abs_scaler.fit_transform(size_array)
Is there a way to scale them on the range of [-1, 1] where the median (or a percentile) is 0? My data is right skewed, so the values above the median are spread out a lot and the values to the left of the median are not spread out. I tried to scaling them with this method:
def using_median():
if x >= median:
return (x - median)/(max - median)
else:
return (median - x)/(median - min)
But that didn't work. Is there any other way to do this with sklearn.preprocessing?
I would recommend using the PowerTransformer(). It can work very well for skewed distributions.
check out this example:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
pt = preprocessing.PowerTransformer()
X_lognormal = np.random.RandomState(616)\
.lognormal(size=(300, 2))
_,ax = plt.subplots(1,2,sharey=True)
ax[0].hist(X_lognormal)
ax[1].hist(pt.fit_transform(X_lognormal))

What's the difference between pandas ACF and statsmodel ACF?

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr function built into Pandas, and the acf function supplied by statsmodels.tsa. This is done in the following MWE:
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf
ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)
ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)
ticker_data_acf_1 = acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]
test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
What accounts for this difference, and which values should be used?
The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
autocorr does nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
The differences may get smaller for longer time series but are quite big for short ones.
Compared to Matlab, the Pandas autocorr function probably corresponds to doing Matlabs xcorr (cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
See this MWE for clarification:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")
def autocorr_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the subseries means
sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
# Normalize with the subseries stds
return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
def acf_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
# Normalize with var of whole series
return sum_product / ((len(x) - lag) * np.var(x))
x = np.linspace(0,100,101)
results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()
Statsmodels uses np.correlate to optimize this, but this is basically how it works.
As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=True to the statsmodels function. Using a random input:
import statistics
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32
def test(unbiased):
data = pd.Series(np.random.random(DATA_LEN))
data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
# return difference between results
return sum(abs(data_acf_1 - data_acf_2))
for value in (False, True):
diffs = [test(value) for _ in range(N_TESTS)]
print(value, statistics.mean(diffs))
Output:
False 0.464562410987
True 0.0820847168593
In the following example, Pandas autocorr() function gives the expected results but statmodels acf() function does not.
Consider the following series:
import pandas as pd
s = pd.Series(range(10))
We expect that there is perfect correlation between this series and any of its lagged series, and this is actually what we get with autocorr() function
[ s.autocorr(lag=i) for i in range(10) ]
# [0.9999999999999999, 1.0, 1.0, 1.0, 1.0, 0.9999999999999999, 1.0, 1.0, 0.9999999999999999, nan]
But using acf() we get a different result:
from statsmodels.tsa.stattools import acf
acf(s)
# [ 1. 0.7 0.41212121 0.14848485 -0.07878788
# -0.25757576 -0.37575758 -0.42121212 -0.38181818 -0.24545455]
If we try acf with adjusted=True the result is even more unexpected because for some lags the result is less than -1 (note that correlation has to be in [-1, 1])
acf(s, adjusted=True) # 'unbiased' is deprecated and 'adjusted' should be used instead
# [ 1. 0.77777778 0.51515152 0.21212121 -0.13131313
# -0.51515152 -0.93939394 -1.4040404 -1.90909091 -2.45454545]

Categories