Using K-Means with predefined centers? - python

I'm running a KNN classifier whose feature vectors come from a K-Means classifier (more specifically, sklearn.cluster.MiniBatchKMeans). Since the K-means starts with random points every time I'm getting different results every time I run my algorithm. I've stored the cluster centers in a separate .npy file from a time where results were good, but now I need to use those centers in my K-means and I don't know how.
Following this advice, I tried to use the cluster centers as starting points like so:
MiniBatchKMeans.__init__(self, n_clusters=self.clusters, n_init=1, init=np.load('cluster_centers.npy'))
Still, results change every time the algorithm is run.
Then I tried to manually alter the cluster centers after fitting the data:
kMeansInstance.cluster_centers_ = np.load('cluster_centers.npy')
Still, different results each time.
The only other solution I can think of is manually implementing the predict method using the centers I saved, but I don't know how and I don't know if there is a better way to solve my problem than rewriting the wheel.

I would guess fixing the random_state will do the job.
See API docu.

Mini batch k-means only considers a sample of the data.
It uses a random generator for this.
If you want deterministic behaviour, fix the random seed, and prefer algorithms that do not use a random sample (i.e., use the regular k-means instead of mini-batch k-means).

Related

scikit-learn: fitting KNeighborsClassifier without labels

I'm trying to fit a simple KNN classifier, and wanted to use the scikit-learn implementation in order to benefit from their efficient implementation (multiprocessing, tree-based algorithms).
However, what I want to get as a result is just the list of distances and nearest neighbours for each data point, rather than the predicted label.
I will then compute the label separately in a non-standard way.
The kneighbors method seems exactly what I need, however I cannot call it without fitting the model with fit first. The issue is, fit() requires the labels (y) as a parameter.
Is there a way to achieve what I'm after? Perhaps I can pass fake labels in the fit() method - is there any issues I'm missing by doing this? E.g. is this going to affect the results (of the computed distances and list of nearest neighbours for each datapoint) in any way? I wouldn't expect so but I'm not familiar with the workings of the scikit-learn implementation.
There is another algorithm for what you desire: NearestNeighbors.
This algorithm is unsupervised (you don't need the y labels); moreover, there is one method (kneighbors) that calculates distances to points and which sample is.
Check the link, it is quite clear.

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn
I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.
How can I achieve this efficiently in pyspark?
This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.
I found this link helpful to understand the LSH.
All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

DecisionTreeClassifier's fit() returns different trees with same data

I have been playing around with sklearn a bit and following some simple examples online using the iris data.
I've now begun to play with some other datas. I'm not sure if this behaviour is correct and I'm misunderstanding but everytime I call fit(x,y) I get completely different tree data. So when I then run predictions I get varying differences (of around 10%), ie 60%, then 70%, then 65% etc...
I ran the code below twice to output 2 trees so I could read them in Word. I tried searching values from one doc in the other and a lot of them I couldn't find.
I kind of assumed fit(x, y) would always return the same tree - if this is the case then I assume my train data of floats is punking me.
clf_dt = tree.DecisionTreeClassifier()
clf_dt.fit(x_train, y_train)
with open("output2.dot", "w") as output_file:
tree.export_graphviz(clf_dt, out_file=output_file)
There is a random component to the algorithm, which you can read about in the user guide. The relevant part:
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
If you want to achieve the same results each time, set the random_state parameter to an integer (by default it's None) and you should get the same result each time.

Handling K-means with large dataset 6gb with scikit-learn?

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

k-fold Cross Validation for determining k in k-means?

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.
To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.
Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.
Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

Categories