k-fold Cross Validation for determining k in k-means? - python

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.

To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.

Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.

Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

Related

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn
I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.
How can I achieve this efficiently in pyspark?
This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.
I found this link helpful to understand the LSH.
All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

Feature scaling for Kmeans algorithm

I know feature scaling is required for KMeans algorithm defined under
sklearn.cluster.KMeans
My question is whether it needs to be done manually before using KMeans or KMeans does automatically perform feature scaling? If automatic, please show me where is it specified in KMeans algorithm as I am unable to find it in the documentation present here:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Btw people say that Kmeans itself takes care of Feature Scaling.
If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.
A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set1. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.
or other multivariate analysis.
1 Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).
As far as I know, K-means does not automatically perform feature scaling. Anyway its a simple process and requires just two additional lines of code. I would recommend using StandardScaler feature scaling. Here is a good example on how to do it.
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
iris = datasets.load_iris()
X = iris.data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
clt = KMeans(n_clusters=3, random_state=0, n_jobs=-1)
model = clt.fit(X_std)

what follows after clustering

I am trying to cluster images based on their similarities with SIFT and Affinity Propagation, I did the clustering but I just don't want to visualize the results. How can I test with a random image from the obtained labels? Or maybe there's more to it?
Other than data visualization, I just don't know what follows after clustering. How do I verify the 'clustering'
since clustering is unsupervised, there isn't an objective way to evaluate it. Typically, you just observe and see if there is some features for a certain cluster.
If you have ground-truth cluster labels, you can measure Jacquad-Index or something in that line to get an error score. Then, you can tweak your distance measure or parameters etc. to minimize the error score.
You can also do some clustering in order to group your data as the divide step in divide-and-conquer algorithms/applications.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

How to know the factor by which a feature affects a model's prediction

I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.

Categories