Scikit-learn kmeans clustering - python

I'm supposed to be doing a kmeans clustering implementation with some data. The example I looked at from http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html shows their test data in 2 columns... however, the data I'm given is 68 subjects with 78 features (so 68x78 matrix). How am I supposed to create an appropriate input for this?
I've basically just tried inputting the matrix anyway, but it doesn't seem to do what I want... and I don't know why it would. I'm pretty confused as to what to do.
data = np.rot90(data)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I honestly don't know what kind of code to show you.. the data format I told you was already described. Otherwise, it's the same as the tutorial I linked.

Your visualization only uses the first two dimensions.
That is why these points appear to be "incorrect" - they are closer in a different dimension.
Have a look at the next two dimensions:
plot(data[idx==0,2],data[idx==0,3],'ob',
data[idx==1,2],data[idx==1,3],'or')
plot(centroids[:,2],centroids[:,3],'sg',markersize=8)
show()
... repeat for all remaining of oyur 78 dimensions...
At this many features, (squared) Euclidean distance gets meaningless, and k-means results tend to become as good as random convex partitions.
To get a more representative view, consider using MDS to project the data into 2d for visualization. It should work reasonably fast with just 68 subjects.
Please include visualizations in your questions. We don't have your data.

Related

what is the best algorithm to cluster this data

can some one help me find a good clustering algorithm that will cluster this into 3 clusters without defining the number of clusters.
i have tried many algorithms in its basic form.. nothing seems to work properly.
clustering = AgglomerativeClustering().fit(temp)
same way i tried the dbscan and kmeans too.. just used the guidelines from sklean. i couldn't get the expected results.
my original data set is a 1D list of numbers.. but the order of the numbers matters, so generated a 2D list as bellow.
temp = []
for i in range(len(avgs)):
temp.append([avgs[i], i+1])
clustering = AgglomerativeClustering().fit(temp)
in plotting piloting i used a similter range as the y axis
ax2.scatter(range(len(plots[i])), plots[i], c=np.random.rand(3,))
the order of the data matters, so this need to clustered into 3. and there might be some other data sets that the data is very good so that the result of that need to be just one cluster.
Link to the list if someone want to try
so i tried using the step detection and got the following image according to ur answer. but how can i find the values of the peaks.. if i get the max value i can get one of them.. but how to get the rest of it.. the second max is not an answer because the one right next to the max is the second max
Your data is not 2d coordinates. So don't choose an algorithm designed for that!
Instead your data appears to be sequential or time series.
What you want to use is a change point detection algorithm, capable of detecting a change in the mean value of a series.
A simple approach would be to compute the sum of the next 10 points minus the sum of the previous 10 points, then look for extreme values of this curve.

How to plot classification regions in a lower dimensional space?

I'm working in a space which has 8 dimensions (i.e. 8 features). I have plotted the data points in 2D by applying PCA as well as TSNE. Now I would like also to draw the borderlines of the classifiers I use as shown here. By the way, I'm using different classifiers (SVM, GNB, Logistic Regression).
This means that I have the different 8-dimensional points which I plot in 2D using PCA or TSNE. On top of this plot I would like to plot the different classification regions as shown in the link above.
Of course the classification boundaries/regions are also 8-dimensional. How can I turn the classification boundaries/regions into 2D matching my 2D data points?
Interesting question here, I once wondered it.
It can be answered several way, including more or less details depending whether you want to fully understand or to apply the method.
As you don't a lot of detail but you included a sklearn link, I will first answer on a technical point of view: "How can you do it with sklearn?"
You have a function for this: transform(X, y=None) which will apply the PCA projection (yes, PCA is a projection for high dimensional space to a lower one).
So you basically just need to give transform(your_boundaries) to apply it.
In term of pseudo code this would give:
pca = PCA(n_component=2).fit(data)
2dboundaries = pca.transform(boundaries)
Et voilĂ !
Do not hesitate to give more details or ask question. I could add some specific development if it is relevant.
Hope it helps
pltrdy

K means clustering on unevenly sized clusters

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?
First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.
I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

How can I fine tune K means clustering when I'm only getting clusters in lines?

It's my first time trying to do K-Means clustering using Python and Sci-Kit Learn and I don't know what to make of my final cluster plot or how to fine tune my K means clustering algorithm.
My end goal is to find a clustering of user categories that delineates some interesting or useful behavior traits.
ATTEMPT 1:
Input: Gender, Age Range, Country (all one hot encoded because the data is categorical), and Account Age (numerical in weeks old)
Code:
# Convert DataFrame to matrix
mat2 = all_dummy.as_matrix()
# Using sklearn
km2 = sklearn.cluster.KMeans(n_clusters=6)
km2.fit(mat2)
# Get cluster assignment labels
labels2 = km2.labels_
# Format results as a DataFrame
results2 = pd.DataFrame([all_dummy.index,labels2]).T
plot_x2 = results2[0].tolist()
plot_y2 = results2[1].tolist()
pyplot.scatter(plot_x2,plot_y2)
pyplot.show()
Plot:
Specific Questions:
What is the X and Y axis of this graph?
What is this graph even telling me?
Why are there only 3 clusters showing up when I put 6 clusters as an input? (answered by first comment and updated code and graph)
How can I fine tune this graph to tell me more and show me a useful relationship if I don't know what the relationship I am looking for is?
Read up on the limitations of k-means.
In particular, be aware that
you must remove all identifier columns
k-means is very sensitive to scale. All attributes need to be carefully scaled according to their value range, distribution, and importance. Preprocessing is essential!
k-means assumes continuous variables. The use on categorical data, even when one-hot encoded, is questionable. It sometimes works "okayish" but barely ever workd "good".
According to your code, the X axis corresponds to the indices of your samples (seeing your graph, I suppose you have around 10 000 users then), and the Y axis corresponds to the labels of each sample.
You might not have 6 clusters as an input. Indeed, when you format your results as a dataframe, a labels variable is used, while it is actually labels2 that contain the computed cluster assignments. I don't know where your labels come from, but I suspect this is the reason you obtain those results. Hence, regarding question 2, this graph probably doesn't show anything relevant.
You first could use other visualisations to better understand how your data is being clustered. Sklearn's documentation provides many examples you could use for inspiration (1, 2, 3).
Hope it helped !

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Categories