Feature scaling for Kmeans algorithm - python

I know feature scaling is required for KMeans algorithm defined under
sklearn.cluster.KMeans
My question is whether it needs to be done manually before using KMeans or KMeans does automatically perform feature scaling? If automatic, please show me where is it specified in KMeans algorithm as I am unable to find it in the documentation present here:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Btw people say that Kmeans itself takes care of Feature Scaling.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.
A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set1. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.
or other multivariate analysis.
1 Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

As far as I know, K-means does not automatically perform feature scaling. Anyway its a simple process and requires just two additional lines of code. I would recommend using StandardScaler feature scaling. Here is a good example on how to do it.
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
iris = datasets.load_iris()
X = iris.data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
clt = KMeans(n_clusters=3, random_state=0, n_jobs=-1)
model = clt.fit(X_std)

Related

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn
I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.
How can I achieve this efficiently in pyspark?
This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.
I found this link helpful to understand the LSH.
All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

How to get different clusters using OPTICS in python by varying the parameter xi?

I am trying to fit OPTICS clustering model to my data using python's sklearn
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(data.loc[:, features])
op = OPTICS(max_eps=20, min_samples=10, xi=0.1)
op = op.fit(x)
From this fitted model, I get the reachability distances (op.reachability_) and the ordering (op.ordering_) of the points and also the cluster labels (op.labels_)
Now, I want to check how the clusters would vary by changing the parameter xi (0.01 in this case). Can I do this without fitting the model again and again with different xi's (which takes a lot of time)?
Or, in other words, is there a scikit-learn function that takes the reachability distances (op.reachability_), the ordering (op.ordering_) of the points and xi as input and outputs the cluster labels?
I found a function cluster_optics_dbscan which "performs DBSCAN extraction for an arbitrary epsilon given reachability-distances, core-distances and ordering and epsilon" (Not quite what I want)
A priori, you need to call the fit method, which is doing the actual cluster computation, as stated in the function description.
However, if you look at the optics class, the cluster_optics_xi function "automatically extract clusters according to the Xi-steep method", calling both the _xi_cluster and _extract_xi_labels functions, which both take the xi parameter as input. So, by using them and refactoring a bit, you may be able to achieve what you want.

Scoring increasing with number of components using PCA

I recently started working in the field of machine learning and stuff related to it using python. Today I'm working on a dataset where I would like to apply a dimension reduction and apply my model to evaluate the score. This dataset got 30 features.
I start with a simple algorithm which is the Logistic Regression but before applying my logistic regression I want to do a PCA.
To determine which number of components is the best I used the gridsearchCV with my logistic regression only playing with the C parameter and my PCA where I choose the number of components.
The result I got is that the more components I use for my PCA the better is the precision score. For my example with n_components=30 I get a precision score of 0.81.
The problem is that I thought PCA is used for dimension reduction (i.e working with fewer features) and that it could help increasing score. Is there something I do not understand?
pca = PCA()
logistic = LogisticRegression()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
param_grid = {
'pca__n_components': [5,10,15,20,25,30],
'logistic__C': [0.01,0.1,1,10,100]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='precision') # fix adding a tuple scoring
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
results = pd.DataFrame(search.cv_results_)
output : Best parameter (CV score=0.881):
{'logistic__C': 0.01, 'pca__n_components': 30}
Thanks in advance for your reply
EDIT: I add this screenshot for more information on the score with number of components
In general, when you do dimension reduction, you lose some information. It is not surprising then that you get a higher score with the full set of PCA features. Working with few features could indeed help increase the score but not necessarily, there are also other good reasons for using PCA for dimension reduction. Here are the main advantages of PCA:
PCA is one good technique for dimension reduction (with its own limitations) in the sense that it concentrate the variance of the dataset in the first dimensions of the computed new space. Hence, dropping the last features is done at a minimal cost in terms of information carried by the dataset (under certain hypotheses). Using PCA for dimension reduction mitigates the risk of overfitting by limiting the number of features, while losing a minimal amount of information. In this sense, less features can increase the score by avoiding overfitting but that is not always true.
Dimension reduction with PCA can also be useful when working with noisy data. PCA will not directly eliminate the noise, but the first few features will have a higher signal-to-noise ratio since the variance of the dataset is concentrated there. The last features may be then dominated by noise and dropped.
Since PCA projects the dataset on a new orthonormal basis, the new features will be all independant from each other. This property is often required by a lot of machine learning algorithms to achieve optimal performance.
Of course, PCA should not be used in any case as it has its own hypotheses and limitations. Here are what I consider the main ones (non exhaustive):
PCA is sensitive to the scaling of the variables. As an example, if you have a temperaturecolumn in your dataset, you will get a different transformation depending on whether you use Celsius or Fahrenheit as the unit because their scale are different. When the variables have different scales, PCA is a bit arbitrary. This can be corrected by scaling all variables to unit variance, but at the cost of modifying (compressing or expanding) the fluctuations of the variables in all dimensions.
PCA captures linear correlations between between the features but fails to capture non-linear correlations.
What would be interesting in your case would be to compare the score obtained with and without the PCA transformation. You would see then if there is a benefit in using it.
Last but not least, your plot shows an interesting thing. The gain in the score between 20 and 30 features is very low (1% ?). You can wonder whether it is worth keeping ten additional features for this very low gain. Indeed, keeping more features increases the risk of having a model with a lower ability to generalize. Cross validation mitigates already this risk, but there are no guarantees that when you apply the model on unseen data, this unseen data will have the exact same properties as your training dataset.

Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction

I have a dataset that consists of 13 columns and I wanted to use PCA for data reduction to remove unwanted columns. My problem is PCA doesn't really show columns names but PC1 PC2 etc. I found out extra tree classifier does the same thing but does indicate the variation of each column. I just wanted to make sure if they both have the same objective or are they different in their outcome. Also would anyone suggest a better methods for Data Reduction?
My last question is that I have a code for Extra tree classifier and wanted to confirm if it was correct or not?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('D:\\Project\\database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 500,
criterion ='entropy', max_features = 'auto')
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_],
axis = 0)
plt.bar(df.columns, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()
Thank You.
The two methods are very different.
PCA doesn't show you the feature names because dimensionality reduction with PCA doesn't really have anything to do with the relative importance of the features. PCA takes the original data and transforms it into a space where each new 'feature' (principal component) is independent of the others, and you can tell how important each principal component is to faithfully representing the data based on its corresponding eigenvalue. Removing the least important principal components reduces dimensionality in principal component space, but not in the original feature space - so you need to do PCA on all future data, too, and then perform all your classification on the (shortened) principal component vectors.
An extra tree classifier trains an entire classifier on your data, so it's much more powerful than just dimensionality reduction. However, it does seem closer to what you're looking for, since the feature importance does directly tell you how relevant each feature is when making a classification.
Note that in PCA, the principal components with the highest eigenvalues contribute the most to accurately reconstructing the data. This is not the same as contributing the most to accurately classifying the data. The extra tree classifier is the reverse: it tells you what features are most important when classifying the data, not when reconstructing it.
Basically, if you think you have a representative dataset right now and are comfortable only storing variables that are relevant to classifying the data you already have, dimensionality reduction with extra trees is a good choice for you. If you just want to faithfully represent the data with less space without being overly concerned about the effects on classification, PCA is the better choice. Dimensionality reduction with PCA will often also help remove irrelevant features from the original data, but that's not what it's optimized for.

k-fold Cross Validation for determining k in k-means?

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.
To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.
Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.
Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

Categories