Get the cluster size in sklearn in python - python

I am using sklearn DBSCAN to cluster my data as follows.
#Apply DBSCAN (sims == my data as list of lists)
db1 = DBSCAN(min_samples=1, metric='precomputed').fit(sims)
db1_labels = db1.labels_
db1n_clusters_ = len(set(db1_labels)) - (1 if -1 in db1_labels else 0)
#Returns the number of clusters (E.g., 10 clusters)
print('Estimated number of clusters: %d' % db1n_clusters_)
Now I want to get the top 3 clusters sorted from the size (number of data points in each cluster). Please let me know how to obtain the cluster size in sklearn?

Another option would be to use numpy.unique:
db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]

Well you can Bincount Function in Numpy to get the frequencies of labels. For example, we will use the example for DBSCAN using scikit-learn:
#Store the labels
labels = db.labels_
#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])
print counts
#Output : [243 244 245]
Then to get the top 3 values use argsort in numpy. In our example since there are only 3 clusters, I will extract the top 2 values :
top_labels = np.argsort(-counts)[:2]
print top_labels
#Output : [2 1]
#To get their respective frequencies
print counts[top_labels]

Related

How to plot clusters using k means with datasets of tweets?

I have a dataset containing tweets, after pre processing the tweets i tried clustering them:
# output the result to a text file.
clusters = df.groupby('cluster')
for cluster in clusters.groups:
f = open('cluster'+str(cluster)+ '.csv', 'w') # create csv file
data = clusters.get_group(cluster)[['id','Tweets']] # get id and tweets columns
f.write(data.to_csv(index_label='id')) # set index to id
f.close()
print("Cluster centroids: \n")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
print("Cluster %d:" % i)
for j in order_centroids[i, :10]: #print out 10 feature terms of each cluster
print (' %s' % terms[j])
print('------------')
Hence it is grouping my tweets in 6 clusters. How can i plot them in 2D as a whole?
First of all, if you're using Kmeans clustering algorithm, you always manually specify the number of clusters, so you could simply set it to 2.
But, if you're somehow (elbow method, silhouette score or something else) decided that 6 clusters are better, than two, you should use some dimensional reduction (sklearn's PCA, TSNE, etc) on your features and then just scatterplot them with point colors corresponding for clusters. This will look like this:
from sklear.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2, svd_solver='full')
X_decomposed = pca.fit_transform(X)
plt.scatter(X_decomposed[0], X_decomposed[1], c=clustering)

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

Clusters features median values using Python

While working on a dataset, I used k-means clustering and I want to explore the median values of the features/variables.
data = pd.DataFrame({'Monetary': rfm_m_log,'Recency': rfm_r_log,'Frequency': rfm_f_log})
matrix = data.as_matrix()
kmeans = KMeans(init='k-means++', n_clusters = 2, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
How to print the median values of Monetary, Recency and Frequency in each cluster? (Cluster 1 and Cluster 2)
It can be done by slicing the data-frame according to the actual classifications:
# class 0 median of the Monetary column
data.iloc[np.argwhere(clusters_customers == 0).ravel()]['Monetary'].median()
# class 1 median of the Monetary column
data.iloc[np.argwhere(clusters_customers == 1).ravel()]['Monetary'].median()

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

Determining the size of cluster after Kmeans in Python

So I have successfully found out the optimal number of clusters required for kmeans algorithm in python, but now how can I find out the exact size of cluster that I get after applying the Kmeans in python?
Here's a code snippet
data=np.vstack(zip(simpleassetid_arr,simpleuidarr))
centroids,_ = kmeans(data,round(math.sqrt(len(uidarr)/2)))
idx,_ = vq(data,centroids)
initial = [cluster.vq.kmeans(data,i) for i in range(1,10)]
var=[var for (cent,var) in initial] #to determine the optimal number of k using elbow test
num_k=int(raw_input("Enter the number of clusters: "))
cent, var = initial[num_k-1]
assignment,cdist = cluster.vq.vq(data,cent)
You can get the cluster size using this:
print np.bincount(idx)
For the the example below, np.bincount(idx) outputs an array of two elements, e.g. [ 156 144]
from numpy import vstack,array
import numpy as np
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
#Print number of elements per cluster
print np.bincount(idx)

Categories