I'm currently using K-means clustering on text data (marketing activity descriptions) vectorized by tf-idf, and have an elbow-informed optional k, have made a scatterplot using PCA, and have added a column with cluster labels to my data frame (all in python). So in one sense I can interpret my clustering model by reviewing the labeled text data.
However, I would like to also be able to extract N most frequent words from each of the clusters.
First I'm reading in the data and getting an optimal k via elbow:
# import pandas to use dataframes and handle tabular data, e.g the labeled text dataset for clustering
import pandas as pd
# read in the data using panda's "read_csv" function
col_list = ["DOC_ID", "TEXT", "CODE"]
data = pd.read_csv('/Users/williammarcellino/Downloads/AEMO_Sample.csv', usecols=col_list, encoding='latin-1')
# use regular expression to clean annoying "/n" newline characters
data = data.replace(r'\n',' ', regex=True)
#import sklearn for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize text in the df and fit the TEXT data. Builds a vocabulary (a python dict) to map most frequent words
# to features indices and compute word occurrence frequency (sparse matrix). Word frequencies are then reweighted
# using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(data.TEXT)
#use elbow method to determine optimal "K"
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Sum_of_squared_distances = []
# we'll try a range of K values, use sum of squared means on new observations to deteremine new centriods (clusters) or not
K = range(6,16)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10)
km = km.fit(X)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Based on that, I build a model at k=9:
# optimal "K" value from elobow plot above
true_k = 9
# define an unsupervised clustering "model" using KMeans
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10)
#fit model to data
model.fit(X)
# define clusters lables (which are integers--a human needs to make them interpretable)
labels=model.labels_
title=[data.DOC_ID]
#make a "clustered" version of the dataframe
data_cl=data
# add label values as a new column, "Cluster"
data_cl['Cluster'] = labels
# I used this to look at my output on a small sample; remove for large datasets in actual analyses
print(data_cl)
# output our new, clustered dataframe to a csv file
data_cl.to_csv('/Users/me/Downloads/AEMO_Sample_clustered.csv')
Finally I plot the principle components:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
model_indices = model.fit_predict(X)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "c", "y", "m", "paleturquoise", "g", 'aquamarine', 'tab:orange']
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in model_indices])
for i, txt in enumerate(labels):
ax.annotate(txt, (x_axis[i]+.005, y_axis[i]), size=10)
Any help extracting and plotting top terms from each cluster would be a great help. Thanks.
I was able to answer my question by using code found here.
def get_top_features_cluster(tf_idf_array, prediction, n_feats):
prediction = km.predict(scatter_plot_points)
labels = np.unique(prediction)
dfs = []
for label in labels:
id_temp = np.where(prediction==label) # indices for each cluster
x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
features = tf_idf_vectorizor.get_feature_names()
best_features = [(features[i], x_means[i]) for i in sorted_means]
df = pd.DataFrame(best_features, columns = ['features', 'score'])
dfs.append(df)
return dfs
dfs = get_top_features_cluster(tf_idf_array, prediction, 15)
this code is not working for me, so I did something like:
vectorizer = TfidfVectorizer(stop_words=stopwords)
X = vectorizer.fit_transform(dfi['text'][~dfi['text'].isna()])
print('How many clusters do you want to use?')
true_k = int(input())
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model.fit(X)
labels=model.labels_
clusters=pd.DataFrame(list(zip(dfi['text'][~dfi['text'].isna()],labels)),columns=['title','cluster'])
features = vectorizer.get_feature_names()
n_feats=15
for i in range(true_k):
cclust=X[clusters['cluster'] == i]
meanWts=cclust.A.mean(axis=0)
sorted_mean_ix = np.argsort(meanWts)[::-1][:n_feats] # indices with top 15 scores
#get most important feature names:
print(np.array(features)[sorted_mean_ix])
Related
I have a dataset containing almost 28K users and almost 7K features
Here's how the dataframe looks like
I have applied K-Means Clustering and here's the code I have done
scaler = MinMaxScaler()
data_rescaled = scaler.fit_transform(df3)
scaled_df = pd.DataFrame(data_rescaled, index=df3.index, columns=df3.columns)
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
pca.fit(scaled_df)
reduced = pca.transform(scaled_df)
kmeanModel = KMeans(n_clusters=100 , random_state = 0)
label = kmeanModel.fit_predict(reduced)
sse = kmeanModel.inertia_
How do I visualize the Clusters vs Users Histogram plot? as X-axis being Clusters and Y-Axis being user id in order to see how many users lie in each cluster
I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)
I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
I have a dataset containing tweets, after pre processing the tweets i tried clustering them:
# output the result to a text file.
clusters = df.groupby('cluster')
for cluster in clusters.groups:
f = open('cluster'+str(cluster)+ '.csv', 'w') # create csv file
data = clusters.get_group(cluster)[['id','Tweets']] # get id and tweets columns
f.write(data.to_csv(index_label='id')) # set index to id
f.close()
print("Cluster centroids: \n")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
print("Cluster %d:" % i)
for j in order_centroids[i, :10]: #print out 10 feature terms of each cluster
print (' %s' % terms[j])
print('------------')
Hence it is grouping my tweets in 6 clusters. How can i plot them in 2D as a whole?
First of all, if you're using Kmeans clustering algorithm, you always manually specify the number of clusters, so you could simply set it to 2.
But, if you're somehow (elbow method, silhouette score or something else) decided that 6 clusters are better, than two, you should use some dimensional reduction (sklearn's PCA, TSNE, etc) on your features and then just scatterplot them with point colors corresponding for clusters. This will look like this:
from sklear.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2, svd_solver='full')
X_decomposed = pca.fit_transform(X)
plt.scatter(X_decomposed[0], X_decomposed[1], c=clustering)
I have a KMeans function I made takes the input def kmeans(x,k, no_of_iterations): and returns the following return points, centroids it gets plotted perfectly, the code for that isn't very relevant. But I want to calculate for it, the silhouette score and graph this for each value.
#Load Data
data = load_digits().data
pca = PCA(2)
#Transform the data
df = pca.fit_transform(data)
X= df
#y = kmeans.fit_predict(X)
#Applying our function
label, centroids = kmeans(df,10,1000)#returns points value and centroids
y = label.fit_predict(data)
#Visualize the results
u_labels = np.unique(label)
for i in u_labels:
plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
the above is code for running the KMeans plot
Below is my attempt to calculate silhouette. This is from an example that imports from KMeans but I don't really want to do that nor did it work with my code.
silhouette_avg = silhouette_score(X, y)
print("The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, y)
You may notice that there is no value here for y, as I have found y is supposed to be the amount of clusters I think? So I had it as 10 at first and it give an error message. I don't know if from this code anyone could tell me what I do next to get this value?
Try this:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, shuffle=True, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof()
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
Also, see this.
https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
I'm trying to reduce my components to 2 instead of 64 but I keep getting this error:
"Length mismatch: Expected axis has 64 elements, new values have 4 elements"
Why is the PCA I'm running on the data set not changing the number to 2?
This is what I have:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import scipy
from sklearn import decomposition
digits = datasets.load_digits() #load the digits dataset instead of the iris dataset
x = pd.DataFrame(digits.data) #was(iris.data)
x.columns = ['Sepal_L', 'Sepal_W', 'Sepal_L', 'Sepal_W']
plt.cla()
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
y = pd.DataFrame(digits.target)
y.columns = ['Targets']
# this line actually builds the machine learning model and runs the algorithm
# on the dataset
model = KMeans(n_clusters = 10) #Run k-means on this datatset to cluster the data into 10 classes
model.fit(x)
#print(model.labels_)
colormap = np.array(['red', 'blue', 'yellow', 'black'])
# Plot the Models Classifications
plt.subplot(1, 2, 2)
plt.scatter(x.Petal_L, x.Petal_W, c=colormap[model.labels_], s=40)
plt.title('K Means Classification')
plt.show()
It's not actually the PCA that is problematic, but just the renaming of your columns: the digits dataset has 64 columns, and you are trying to name the columns according to the column names for the 4 columns in the iris dataset.
Because of the nature of the digits dataset (pixels), there isn't really an appropriate naming scheme for the columns. So just don't rename them.
digits = datasets.load_digits()
x = pd.DataFrame(digits.data)
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
# Here is the result of your PCA (2 components)
>>> x
array([[ -1.25946636, 21.27488332],
[ 7.95761139, -20.76869904],
[ 6.99192268, -9.9559863 ],
...,
[ 10.80128366, -6.96025224],
[ -4.87210049, 12.42395326],
[ -0.34438966, 6.36554934]])
Then you can plot the first pc against the second, if that's what you're going for (what I gathered from your code)
plt.scatter(x[:,0], x[:,1], s=40)
plt.show()