I have a dataset containing tweets, after pre processing the tweets i tried clustering them:
# output the result to a text file.
clusters = df.groupby('cluster')
for cluster in clusters.groups:
f = open('cluster'+str(cluster)+ '.csv', 'w') # create csv file
data = clusters.get_group(cluster)[['id','Tweets']] # get id and tweets columns
f.write(data.to_csv(index_label='id')) # set index to id
f.close()
print("Cluster centroids: \n")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
print("Cluster %d:" % i)
for j in order_centroids[i, :10]: #print out 10 feature terms of each cluster
print (' %s' % terms[j])
print('------------')
Hence it is grouping my tweets in 6 clusters. How can i plot them in 2D as a whole?
First of all, if you're using Kmeans clustering algorithm, you always manually specify the number of clusters, so you could simply set it to 2.
But, if you're somehow (elbow method, silhouette score or something else) decided that 6 clusters are better, than two, you should use some dimensional reduction (sklearn's PCA, TSNE, etc) on your features and then just scatterplot them with point colors corresponding for clusters. This will look like this:
from sklear.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2, svd_solver='full')
X_decomposed = pca.fit_transform(X)
plt.scatter(X_decomposed[0], X_decomposed[1], c=clustering)
Related
I'm currently using K-means clustering on text data (marketing activity descriptions) vectorized by tf-idf, and have an elbow-informed optional k, have made a scatterplot using PCA, and have added a column with cluster labels to my data frame (all in python). So in one sense I can interpret my clustering model by reviewing the labeled text data.
However, I would like to also be able to extract N most frequent words from each of the clusters.
First I'm reading in the data and getting an optimal k via elbow:
# import pandas to use dataframes and handle tabular data, e.g the labeled text dataset for clustering
import pandas as pd
# read in the data using panda's "read_csv" function
col_list = ["DOC_ID", "TEXT", "CODE"]
data = pd.read_csv('/Users/williammarcellino/Downloads/AEMO_Sample.csv', usecols=col_list, encoding='latin-1')
# use regular expression to clean annoying "/n" newline characters
data = data.replace(r'\n',' ', regex=True)
#import sklearn for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize text in the df and fit the TEXT data. Builds a vocabulary (a python dict) to map most frequent words
# to features indices and compute word occurrence frequency (sparse matrix). Word frequencies are then reweighted
# using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(data.TEXT)
#use elbow method to determine optimal "K"
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Sum_of_squared_distances = []
# we'll try a range of K values, use sum of squared means on new observations to deteremine new centriods (clusters) or not
K = range(6,16)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10)
km = km.fit(X)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Based on that, I build a model at k=9:
# optimal "K" value from elobow plot above
true_k = 9
# define an unsupervised clustering "model" using KMeans
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10)
#fit model to data
model.fit(X)
# define clusters lables (which are integers--a human needs to make them interpretable)
labels=model.labels_
title=[data.DOC_ID]
#make a "clustered" version of the dataframe
data_cl=data
# add label values as a new column, "Cluster"
data_cl['Cluster'] = labels
# I used this to look at my output on a small sample; remove for large datasets in actual analyses
print(data_cl)
# output our new, clustered dataframe to a csv file
data_cl.to_csv('/Users/me/Downloads/AEMO_Sample_clustered.csv')
Finally I plot the principle components:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
model_indices = model.fit_predict(X)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "c", "y", "m", "paleturquoise", "g", 'aquamarine', 'tab:orange']
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in model_indices])
for i, txt in enumerate(labels):
ax.annotate(txt, (x_axis[i]+.005, y_axis[i]), size=10)
Any help extracting and plotting top terms from each cluster would be a great help. Thanks.
I was able to answer my question by using code found here.
def get_top_features_cluster(tf_idf_array, prediction, n_feats):
prediction = km.predict(scatter_plot_points)
labels = np.unique(prediction)
dfs = []
for label in labels:
id_temp = np.where(prediction==label) # indices for each cluster
x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
features = tf_idf_vectorizor.get_feature_names()
best_features = [(features[i], x_means[i]) for i in sorted_means]
df = pd.DataFrame(best_features, columns = ['features', 'score'])
dfs.append(df)
return dfs
dfs = get_top_features_cluster(tf_idf_array, prediction, 15)
this code is not working for me, so I did something like:
vectorizer = TfidfVectorizer(stop_words=stopwords)
X = vectorizer.fit_transform(dfi['text'][~dfi['text'].isna()])
print('How many clusters do you want to use?')
true_k = int(input())
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model.fit(X)
labels=model.labels_
clusters=pd.DataFrame(list(zip(dfi['text'][~dfi['text'].isna()],labels)),columns=['title','cluster'])
features = vectorizer.get_feature_names()
n_feats=15
for i in range(true_k):
cclust=X[clusters['cluster'] == i]
meanWts=cclust.A.mean(axis=0)
sorted_mean_ix = np.argsort(meanWts)[::-1][:n_feats] # indices with top 15 scores
#get most important feature names:
print(np.array(features)[sorted_mean_ix])
I have a map of data:
import seaborn as sns
import matplotlib.pyplot as plt
X = 101_by_99_float32_array
ax = sns.heatmap(X, square = True)
plt.show()
Note these data are essentially a 3D surface, and I'm interested in the index positions in X after clustering. I can easily apply the kmeans algorithm to my data:
from sklearn.cluster import KMeans
# three clusters is arbitrary; just used for testing purposes
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10).fit(X)
But I am not sure how to navigate kmeans in a way that will identify to which cluster a pixel in the map above belongs. What I want to do is make a map that looks like the one above, but instead of plotting the z-value for each cell in the 100x99 array X, I'd like to plot the cluster number for each cell in X.
I don't know if this is possible with the output of the kmeans algorithm, but I did try an approach from the scikitlearn documents here:
import numpy as np
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)
colors = ['#4EACC5', '#FF9C34', '#4E9A06']
plt.figure()
#plt.hold(True)
for k, col in zip(range(3), colors):
my_members = k_means_labels == k
cluster_center = k_means_cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('KMeans')
plt.show()
But it's clear this is not accessing the information I want...
It's obvious I do not fully understanding what each component of the kmeans output represents, and I've tried to read the explanations in the answer to the question found here. However, there's nothing in that answer that explicitly addresses whether the indices of the original data were preserved after clustering, which is really the core of my question. If such information is implicitly present in kmeans through some matrix multiplication, I could really use some help extracting it.
Thank you for your time and assistance!
EDIT:
Thanks to #Nakor, for both the explanation about kmeans and the suggestion to reshape my data. How kmeans is interpreting my data is now much clearer. I should not expect it to capture the indices of each sample, but instead rely on reshape to do so. reshape will ravel the original (101,99) matrix into (9999,1) array which, as #Nakor pointed out, is suitable for clustering every entry as an individual sample.
Simply reapply reshape to kmeans.labels_ using the original shape of the data and I've gotten the result I'm looking for:
Y = X.reshape(-1, 1) # shape data to cluster each individual entry
kmeans= KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(Y)
Z = kmeans.labels_
A = Z.reshape(101,99)
plt.figure()
ax = sns.heatmap(cu_map, square = True)
plt.figure()
ay = sns.heatmap(A, square = True)
Your issue is that sklearn.cluster.KMeans expects a 2D matrix with [N_samples,N_features]. However, you provide the raw image, so sklearn understands you have 101 samples with 99 features each (each row of your image is a sample, and the columns are the features). As a results, what you get in k_means.labels_ is the cluster assignment of each of the rows.
In you want instead to cluster every single entry, you need to reshape your data like this for instance:
model = KMeans(init='k-means++', n_clusters=3, n_init=10)
model.fit(X.reshape(-1,1))
If I check with randomly generated data, I get:
In [1]: len(model.labels_)
Out[1]: 9999
I have one label per entry.
I have this plot
Now I want to add a trend line to it, how do I do that?
The data looks like this:
I wanted to just plot how the median listing price in California has gone up over the years so I did this:
# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
if ", CA" not in state:
continue
else:
state_ca.append(state)
state_median_price.append(price)
state_ca_month.append(date)
Then I converted the string state_ca_month to datetime:
# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]
Then plotted it
# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()
I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.
Following the advice in the comments I get this scatter plot
I am wondering if I should further format the data to make a clearer plot to examine.
If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.
From the example hyperlinked above:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.
I am using sklearn DBSCAN to cluster my data as follows.
#Apply DBSCAN (sims == my data as list of lists)
db1 = DBSCAN(min_samples=1, metric='precomputed').fit(sims)
db1_labels = db1.labels_
db1n_clusters_ = len(set(db1_labels)) - (1 if -1 in db1_labels else 0)
#Returns the number of clusters (E.g., 10 clusters)
print('Estimated number of clusters: %d' % db1n_clusters_)
Now I want to get the top 3 clusters sorted from the size (number of data points in each cluster). Please let me know how to obtain the cluster size in sklearn?
Another option would be to use numpy.unique:
db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]
Well you can Bincount Function in Numpy to get the frequencies of labels. For example, we will use the example for DBSCAN using scikit-learn:
#Store the labels
labels = db.labels_
#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])
print counts
#Output : [243 244 245]
Then to get the top 3 values use argsort in numpy. In our example since there are only 3 clusters, I will extract the top 2 values :
top_labels = np.argsort(-counts)[:2]
print top_labels
#Output : [2 1]
#To get their respective frequencies
print counts[top_labels]
So I have successfully found out the optimal number of clusters required for kmeans algorithm in python, but now how can I find out the exact size of cluster that I get after applying the Kmeans in python?
Here's a code snippet
data=np.vstack(zip(simpleassetid_arr,simpleuidarr))
centroids,_ = kmeans(data,round(math.sqrt(len(uidarr)/2)))
idx,_ = vq(data,centroids)
initial = [cluster.vq.kmeans(data,i) for i in range(1,10)]
var=[var for (cent,var) in initial] #to determine the optimal number of k using elbow test
num_k=int(raw_input("Enter the number of clusters: "))
cent, var = initial[num_k-1]
assignment,cdist = cluster.vq.vq(data,cent)
You can get the cluster size using this:
print np.bincount(idx)
For the the example below, np.bincount(idx) outputs an array of two elements, e.g. [ 156 144]
from numpy import vstack,array
import numpy as np
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
#Print number of elements per cluster
print np.bincount(idx)