what is the best algorithm to cluster this data

what is the best algorithm to cluster this data - python

can some one help me find a good clustering algorithm that will cluster this into 3 clusters without defining the number of clusters.
i have tried many algorithms in its basic form.. nothing seems to work properly.
clustering = AgglomerativeClustering().fit(temp)
same way i tried the dbscan and kmeans too.. just used the guidelines from sklean. i couldn't get the expected results.
my original data set is a 1D list of numbers.. but the order of the numbers matters, so generated a 2D list as bellow.
temp = []
for i in range(len(avgs)):
temp.append([avgs[i], i+1])
clustering = AgglomerativeClustering().fit(temp)
in plotting piloting i used a similter range as the y axis
ax2.scatter(range(len(plots[i])), plots[i], c=np.random.rand(3,))
the order of the data matters, so this need to clustered into 3. and there might be some other data sets that the data is very good so that the result of that need to be just one cluster.
Link to the list if someone want to try
so i tried using the step detection and got the following image according to ur answer. but how can i find the values of the peaks.. if i get the max value i can get one of them.. but how to get the rest of it.. the second max is not an answer because the one right next to the max is the second max

Your data is not 2d coordinates. So don't choose an algorithm designed for that!
Instead your data appears to be sequential or time series.
What you want to use is a change point detection algorithm, capable of detecting a change in the mean value of a series.
A simple approach would be to compute the sum of the next 10 points minus the sum of the previous 10 points, then look for extreme values of this curve.

Related

HDBSCAN Shouldn't any object in a cluster have a probability value > 0? And producing inconsistent results

I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook.
import pandas as pandas
import numpy as np
data = pandas.read_csv('data.csv')
That data looks something like this:
import hdbscan
clusterSize = 6
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
And yay! everything seems to work!
So I then want to see some results, so I add these results to my data frame:
data.insert(18,"labels",clusterer.labels_)
data.insert(19,"probabilities",clusterer.probabilities_)
But wait, I have rows with labels for clusters that have probabilities at 0. How does that make sense? Shouldn't any object in a cluster have a probability value > 0? Oh, and all the probabilities are only 0 OR 1.
So I rerun this in Jupyter notebook, specifically, I just rerun
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
and I check the values for clusterer.labels_ and clusterer.probabilities_ and they are different. Isn't this thing supposed to be consistent? Why would those values change? Is there some hidden state that I'm not told about? But now my clusterer.probabilities_ have values that are between 0 and 1... so that's good right?
So I'm not very familiar with this hdbscan tool obviously, but can someone explain why it gives out different answers when ran multiple times and if probability 0 on a labeled/clustered object makes sense?

According to API:
labels: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
probabilities: The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.
Therefore probability of zero is meaningful.
I was also expecting that the results of different runs on the same data be the same, but it looks like it is not exactly true. According to wiki:
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4] both on core points and noise points, DBSCAN is deterministic. DBSCAN* is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.
So maybe the selection of a specific algorithm will help to fix the clustering be deterministic.

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks

Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.

The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

K means clustering on unevenly sized clusters

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?

First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.

I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

How can I fine tune K means clustering when I'm only getting clusters in lines?

It's my first time trying to do K-Means clustering using Python and Sci-Kit Learn and I don't know what to make of my final cluster plot or how to fine tune my K means clustering algorithm.
My end goal is to find a clustering of user categories that delineates some interesting or useful behavior traits.
ATTEMPT 1:
Input: Gender, Age Range, Country (all one hot encoded because the data is categorical), and Account Age (numerical in weeks old)
Code:
# Convert DataFrame to matrix
mat2 = all_dummy.as_matrix()
# Using sklearn
km2 = sklearn.cluster.KMeans(n_clusters=6)
km2.fit(mat2)
# Get cluster assignment labels
labels2 = km2.labels_
# Format results as a DataFrame
results2 = pd.DataFrame([all_dummy.index,labels2]).T
plot_x2 = results2[0].tolist()
plot_y2 = results2[1].tolist()
pyplot.scatter(plot_x2,plot_y2)
pyplot.show()
Plot:
Specific Questions:
What is the X and Y axis of this graph?
What is this graph even telling me?
Why are there only 3 clusters showing up when I put 6 clusters as an input? (answered by first comment and updated code and graph)
How can I fine tune this graph to tell me more and show me a useful relationship if I don't know what the relationship I am looking for is?

Read up on the limitations of k-means.
In particular, be aware that
you must remove all identifier columns
k-means is very sensitive to scale. All attributes need to be carefully scaled according to their value range, distribution, and importance. Preprocessing is essential!
k-means assumes continuous variables. The use on categorical data, even when one-hot encoded, is questionable. It sometimes works "okayish" but barely ever workd "good".

According to your code, the X axis corresponds to the indices of your samples (seeing your graph, I suppose you have around 10 000 users then), and the Y axis corresponds to the labels of each sample.
You might not have 6 clusters as an input. Indeed, when you format your results as a dataframe, a labels variable is used, while it is actually labels2 that contain the computed cluster assignments. I don't know where your labels come from, but I suspect this is the reason you obtain those results. Hence, regarding question 2, this graph probably doesn't show anything relevant.
You first could use other visualisations to better understand how your data is being clustered. Sklearn's documentation provides many examples you could use for inspiration (1, 2, 3).
Hope it helped !

Scikit-learn kmeans clustering

I'm supposed to be doing a kmeans clustering implementation with some data. The example I looked at from http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html shows their test data in 2 columns... however, the data I'm given is 68 subjects with 78 features (so 68x78 matrix). How am I supposed to create an appropriate input for this?
I've basically just tried inputting the matrix anyway, but it doesn't seem to do what I want... and I don't know why it would. I'm pretty confused as to what to do.
data = np.rot90(data)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I honestly don't know what kind of code to show you.. the data format I told you was already described. Otherwise, it's the same as the tutorial I linked.

Your visualization only uses the first two dimensions.
That is why these points appear to be "incorrect" - they are closer in a different dimension.
Have a look at the next two dimensions:
plot(data[idx==0,2],data[idx==0,3],'ob',
data[idx==1,2],data[idx==1,3],'or')
plot(centroids[:,2],centroids[:,3],'sg',markersize=8)
show()
... repeat for all remaining of oyur 78 dimensions...
At this many features, (squared) Euclidean distance gets meaningless, and k-means results tend to become as good as random convex partitions.
To get a more representative view, consider using MDS to project the data into 2d for visualization. It should work reasonably fast with just 68 subjects.
Please include visualizations in your questions. We don't have your data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.