Weighted clustering in sklearn

Weighted clustering in sklearn - python

Assuming I have a set of points (x,y and size). I want to find clusters in my data using sklearn.cluster.DBSCAN and their centers. That is no problem if I treat every point the same. But actually I want the weighted centers instead of the geometrical centers (meaning a bigger sized point should be counted more than a smaller) .
I came across with sample_weight, but I don't quite get if that is what I need. When I use sample_weight (right side) I get completely different clusters to the case when I don`t use it (left side):
Second I thought about using np.repeat(x,w) where x is my data and w is the size of each point so I get multiple copies of the points proportional to their weights. But this is probably not a smart solution as I get a lot of data, right?
Is sample_weight useful in my case or are there suggestions for better solutions than using np.repeat? I know that there are some questions about sample_weight already, but I could not read out how to use it exactly.
Thanks!

The most important thing for DBSCAN is the parameter setting. There are 2 parameters, epsilon and minPts (=min_samples). The epsilon parameter is the radius around your points and minPts considers your points as a part of a cluster if minPts is fulfilled. So instead of using np.repeat I would suggest adjusting the parameters for this dataset.
According to the documentation of DBSCAN, sample_weight is a tuning parameter for your runtime:
Another way to reduce memory and computation time is to remove
(near-)duplicate points and use sample_weight instead.
I think you want to address the quality of your result first before you tune your runtime.
I am not sure what you mean with weighted centers, probably you are refering to a differt clustering algorithm such as Gaussian mixture model.

Related

how to build clusters that are approximately balanced in size in sklearn

As seen above,how to build clusters that are approximately balanced in size in sklearn?I have a question，clustering is done according to certain rules,Why can we specify the number in cluster?Anyway, I want to know how to achieve this step.

I have another idea about it.Calculate the number of each label, then calculate the variance,and get the one with the smallest variance

Some methods (for example, non-sklearn's HDBSCAN: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) have parameters like minimal_cluster_size. Probably, sklearn's DBSCAN's min_samples will work the similar way. It will not give you exact 'balanced' clusters but may help.
But in my opinion, sometimes it is more reasonable to run clusterization algorithms with different parameters and select 'more balanced' output by your hands. In this case you can see what points are not separable and probably add more data (calculate additional distance matrix, for example) or change target metric.
Why can we specify the number in cluster?
Because the tasks 'find clusters' and 'balance them' are a bit opposite in their meaning in the most cases. I'm not even speaking about algorithms when you need to specify the number of clusters.

K means clustering on unevenly sized clusters

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?

First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.

I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

scikit KernelPCA unstable results

I'm trying to use KernelPCA for reducing the dimensionality of a dataset to 2D (both for visualization purposes and for further data analysis).
I experimented computing KernelPCA using a RBF kernel at various values of Gamma, but the result is unstable:
(each frame is a slightly different value of Gamma, where Gamma is varying continuously from 0 to 1)
Looks like it is not deterministic.
Is there a way to stabilize it/make it deterministic?
Code used to generate transformed data:
def pca(X, gamma1):
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=gamma1)
X_kpca = kpca.fit_transform(X)
#X_back = kpca.inverse_transform(X_kpca)
return X_kpca

KernelPCA should be deterministic and evolve continuously with gamma. It is different from RBFSampler that does have built-in randomness in order to provide an efficient (more scalable) approximation of the RBF kernel.
However what can change in KernelPCA is the order of the principal components: in scikit-learn they are returned sorted in order of descending eigenvalue, so if you have 2 eigenvalues close to each other it could be that the order changes with gamma.
My guess (from the gif) is that this is what is happening here: the axes along which you are plotting are not constant so your data seems to jump around.
Could you provide the code you used to produce the gif?
I'm guessing it is a plot of the data points along the 2 first principal components but it would help to see how you produced it.
You could try to further inspect it by looking at the values of kpca.alphas_ (the eigenvectors) for each value of gamma.
Hope this makes sense.
EDIT: As you noted it looks like the points are reflected against the axis, the most plausible explanation is that one of the eigenvector flips sign (note this does not affect the eigenvalue).
I put in a simple gist to reproduce the issue (you'll need a Jupyter notebook to run it). You can see the sign-flipping when you change the value of gamma.
As a complement note that this kind of discrepancy happens only because you fit several times the KernelPCA object several times. Once you settled with a particular gamma value and you've fit kpca once you can call transform several times and get consistent results.
For the classical PCA the docs mention that:
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.
I don't know about the behavior of a single KernelPCA object that you would fit several times (I did not find anything relevant in the docs).
It does not apply to your case though as you have to fit the object with several gamma values.

So... I can't give you a definitive answer on why KernelPCA is not deterministic. The behavior resembles the differences I've observed between the results of PCA and RandomizedPCA. PCA is deterministic, but RandomizedPCA is not, and sometimes the eigenvectors are flipped in sign relative to the PCA eigenvectors.
That leads me to my vague idea of how you might get more deterministic results....maybe. Use RBFSampler with a fixed seed:
def pca(X, gamma1):
kernvals = RBFSampler(gamma=gamma1, random_state=0).fit_transform(X)
kpca = PCA().fit_transform(X)
X_kpca = kpca.fit_transform(X)
return X_kpca

Efficient k-means evaluation with silhouette score in sklearn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

Is it possible to use complex numbers as target labels in scikit learn?

I am trying to use sklearn to predict a variable that represents rotation. Because of the unfortunate jump from -pi to pi at the extremes of rotation, I think a much better method would be to use a complex number as the target. That way an error from 1+0.01j to 1-0.01j is not as devastating.
I cannot find any documentation that describes whether sklearn supports complex numbers as targets to classifiers. In theory the distance metric should work just fine, so it should work for at least some regression algorithms.
Can anyone suggest how I can get a regression algorithm to operate with complex numbers as targets?

Several regressors support multidimensional regression targets. Just view the complex numbers as 2d points.

So far I discovered that most classifiers, like linear regressors, will automatically convert complex numbers to just the real part.
kNN and RadiusNN regressors, however, work well - since they do a weighted average of the neighbor labels and so handle complex numbers gracefully.
Using a multi-target classifier is another option, however I do not want to decouple the x and y directions since that may lead to unstable solutions as Colonel Panic mentions, when both results come out close to 0.
I will try other classifiers with complex targets and update the results here.

Good question. How about transforming angles into a pair of labels, viz. x and y co-ordinates. These are continuous functions of angle (cos and sin). You can combine the results from separate x and y classifiers for an angle? $\theta = \sign(x) \arctan(y/x)$. However that result will be unstable if both classifiers return numbers near zero.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.