Normalising Data to use Cosine Distance in Kmeans (Python) - python

I am currently solving a problem where I have to use Cosine distance as the similarity measure for Kmeans clustering. However, the standard Kmeans clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you to change this.
Therefor it is my understanding that by normalising my original dataset through the the code below. I can then run kmeans package (using Euclidean distance) and it will be the same as if I had changed the distance metric to Cosine Distance?
from sklearn import preprocessing # to normalise existing X
X_Norm = preprocessing.normalize(X)
km2 = cluster.KMeans(n_clusters=5,init='random').fit(X_Norm)
Please let me know if my mathematical understanding of this is incorrect?

Related

K-means with Cosine similarity in Python

I have embedded vectors with K-means using cosine similarity.
when I used Spark to to so, I had no problem. Now I wish to convert it into Python.
Is it possible to so with scikit learn? I couldn't find this option, it seems to support only Euclidian distance, is that correct?
Do you think that scaling all vectors to be on the unit sphere and then run the sklearn Kmeans is identical to running k-means with cosine similarity?
If 2 is correct, so for inference on new points, should I just scale it to the unit sphere and use the sklearn Kmeans inference?

Time-series clustering in python: DBSCAN and OPTICS giving me strange results

I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?

How to implement a sklearn -AgglomerativeClustering from clusters?

I'd like to perform a "mixed" unsupervised clustering which uses first a KMeans algorithm to generate a certain number of first small and homogeneous clusters and THEN apply a hierarchical clustering on these clusters I get from Kmeans.
I used cluster.Kmeans from scikit-learn for the first part, and I have my clusters but then I don't know how to use the AgglomerativeClustering function from sklearn so that it can go from those clusters.
Any ideas?
Thank you !
You also get the labels from KMeans.
These give you the partitions.
Just see the manual.

Weighted distance in sklearn KNN

I'm making a genetic algorithm to find weights in order to apply them to the euclidean distance in the sklearn KNN, trying to improve the classification rate and removing some characteristics in the dataset (I made this with changing the weight to 0).
I'm using Python and the sklearn's KNN.
This is how I'm using it:
def w_dist(x, y, **kwargs):
return sum(kwargs["weights"]*((x-y)*(x-y)))
KNN = KNeighborsClassifier(n_neighbors=1,metric=w_dist,metric_params={"weights": w})
KNN.fit(X_train,Y_train)
neighbors=KNN.kneighbors(n_neighbors=1,return_distance=False)
Y_n=Y_train[neighbors]
tot=0
for (a,b)in zip(Y_train,Y_vecinos):
if a==b:
tot+=1
reduc_rate=X_train.shape[1]-np.count_nonzero(w)/tamaƱo
class_rate=tot/X_train.shape[0]
It's working really well, but it's very slow. I have been profiling my code and the slowest part is the evaluation of the distance.
I want to ask if there is some different way to tell KNN to use weights in the distance (I must use the euclidean distance, but I remove the square root).
Thanks!
There is indeed another way, and it's inbuilt into scikit-learn (so should be quicker). You can use the wminkowski metric with weights. Below is an example with random weights for the features in your training set.
knn = KNeighborsClassifier(metric='wminkowski', p=2,
metric_params={'w': np.random.random(X_train.shape[1])})

Affinity Propagation method selection

I am using sklearn affinity propagation algorithm as below.
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
I also have a similarity matrix created for the data I am using. Now I want to use my similarity matrix to use in the affinity propagation model.
In sklearn they have different methods for this such as fit, fit_predict, predict. So, I'm not sure what to use.
Is it correct if I use,
affprop.fit(my similarity matrix)
Please suggest me what suits me most?

Categories