I am using some data to generate some labels so that I can sort my data to be used in a supervised learning environment. I have been generating a dendrogram to visualize how the data clusters but when I use KMeans to create the labels only a few of the labels show that they are in the shown cluster for the dendrogram.
code:
combined_array = pd.read_pickle('arrays.pickle')
model = KMeans(algorithm = 'auto', copy_x = True, init = 'k-means++', max_iter = 300,
n_clusters = 7, n_init = 10, n_jobs = 1, precompute_distances = 'auto',
random_state = 1, tol = 0.0001, verbose = 0)
model.fit(combined_array)
labels = model.predict(combined_array)
pd.DataFrame(labels).to_csv("arrays_labels.csv")
mergings = linkage(combined_array, method = 'ward')
dendrogram(mergings, leaf_rotation = 0, leaf_font_size = 14, show_contracted = True)
The image above shows a section of what files should be in that cluster but when I use kmeans to generate labels only files 28, 33, 41, 45, 70 are included. So why aren't 13, 42, 67, 81 showing up in my labels? Do KMeans and dendrogram create different types of clustering?
I don't really link your code to what you are asking, but yes! They're totally different!
Dendrogram is done by applying Hierarchical Clustering, very simple and DETERMINISTIC (you apply it 2 times? You'll get same result).
It works in this way:
1) Compute distance between points
2) Select the minimun distance
3) Aggregate the 2 points with minimum distance in a cluster
4) Go to 1 until you get 1 cluster containing all elements
There are a lot of details omitted but this is the core.
As you can see it's based on distance between points and does not tell you which cluster configuration is the best, there are techniques to select the number of clusters.
K-means needs to know previously the number of clusters you are looking for (see that you specify n_clusters in the code).
It works like this:
1) Randomly initialize n Centroids (center of mass of a cluster)
2) Assign each point to its closest centroid
3) Re-compute center of mass of the clusters created
4) Go to 2 until convergence
So - if I'm right - what you are trying to do is to generate labels from a clustering algorithm to then fit a supervised model.
So what you are looking for is simply clustering model selection.
To select the best number of clusters and the best algorithm there are a lot of techniques which highly depend on your problem and your data (have a deep look to scikit documentation before doing any kind of clustering)
If you want a general approach, try to look at this library which can select the best results among the ones you provide.
PS: An approach which can go well in general is Silouhettes Analysis
Related
I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.
I want to use BIC criterion to find the optimal number of clusters for GMM clustering. I plotted the BIC scores for cluster numbers 2 to 41, and get the attached curve. I have no idea how to interpret this, can someone help?
For reference, this is the code I used to do GMM clustering. It is applied to daily wind vector data over a region, totaling approximately 5,500 columns and 13,880 rows.
def gmm_clusters(df_std, dates):
ks = range(2, 44, 3)
bic_scores = []
csv_files = []
for k in ks:
model = GaussianMixture(n_components=k,
n_init=1,
init_params='random',
covariance_type='full',
verbose=0,
random_state=123)
fitted_model = model.fit(df_std)
bic_score = fitted_model.bic(df_std)
bic_scores.append(bic_score)
labels = fitted_model.predict(df_std)
print("Labels counts")
print(np.bincount(labels))
df_label = pandas.DataFrame(df_std)
print("############ dataframe AFTER CLUSTERING ###############")
df_dates = pandas.DataFrame(dates)
df_dates.columns = ['Date']
df_dates = df_dates.reset_index(drop=True)
df_label = df_label.join(df_dates)
df_label["Cluster"] = labels
print(df_label)
csv_file = "{0}_GMM_2_Countries_850hPa.csv".format(k)
df_label.to_csv(csv_file)
csv_files.append(csv_file)
return ks, bic_scores, csv_files
Thank you!!
EDIT:
Using K-means on the same data, I get this elbow plot (plot of SSE):
This is fairly clear to interpret, indicating that 11 clusters is the optimum.
The first thing that springs to mind is check the numbers of clusters below 10 with a step of 1, not 3. Maybe there is a dip in BIC you are missing there.
The second thing is maybe check aic vs bic. See here: https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
The third thing is that your dataset has 5,500 dimensions, but only 13,880 points. There is less than 3 points per dimension. I would be surprised to find any clustering at all (which is what the BIC chart is indicating). You'd need to tell more about the data and what each column means and what clustering you are looking for.
Hi i have gotten the mean of the vectors and used DBSCAN to cluster them. However, i am unsure of how i should plot the results since my data does not have an [x,y,z...] format.
sample dataset:
mean_vec = [[2.2771908044815063],
[3.0691280364990234],
[2.7700443267822266],
[2.6123080253601074],
[2.6043469309806824],
[2.6386525630950928],
[2.7034034729003906],
[2.3540258407592773]]
I have used this code below(from scikit-learn) to achieve my clusters:
X = StandardScaler().fit_transform(mean_vec)
db = DBSCAN(eps = 0.15, min_samples = 5).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
is it possible to plot out my clusters ? the plot from scikit-learn is not working for me. The scikit-learn link can be found here
On one dimensional data. Use kernel density estimation rather than DBSCAN. It is much better supported by theory and much better understood. One can see DBSCAN as a fast approximation to KDE for the multivariate case.
Any way, plotting 1 dimensional data is not that hard. For example, you can plot a histogram.
Also the clusters will necessarily correspond to intervals, so you can also plot lines for (min,max) of each cluster.
You can even abuse 2d scatter plots. Simply use the label as y value.
I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?
Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
Output:
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator
Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.
HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.