Apply KNN from small supervised dataset to large unsupervised dataset in Python - python

I have trained and tested a KNN model on a small supervised dataset of about 200 samples in Python. I would like to apply these results to a much larger unsupervised dataset of several thousand samples.
My question is: is there a way to fit the KNN model using the small supervised dataset, and then change the K-value for the large unsupervised dataset? I do not want to overfit the model by using the low K-value from the smaller dataset but am unsure how to fit the model and then change the K-value in Python.
Is this possible using KNN? Is there some other way to apply KNN to a much larger unsupervised dataset?

I would recommend actually fitting a KNN model on your larger dataset a couple of different times, each time using a different value for k. For each of those models you could then calculate the Silhouette Score.
Compare the various silhouette scores and choose for your final value of k (number of clusters) the value that you used for your highest scoring model.
As an example, here's some code I used to do this for myself last year:
from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
## A list of the different numbers of clusters (the 'n_components' parameter) with
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
# Plot the each value of 'number of clusters' vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('GMM - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(number_of_clusters, silhouette_scores)
# Ticks and grid
xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
silhouette_scores = []
for i in number_of_clusters:
clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
clusterer.fit(<your data set>) # enter your data set's variable name here
preds = clusterer.predict(<your data set>)
score = silhouette_score(<your data set>, preds)
silhouette_scores.append(score)
## Print a table of all the silhouette scores
print("")
print("| Number of clusters | Silhouette score |")
print("| ------------------ | ---------------- |")
for i in range(len(number_of_clusters)):
## Ensure printed table is properly formatted, taking into account
## amount of digits (either one or two) in the value for number of clusters.
if number_of_clusters[i] <= 9:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
else:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
## Graph the plot of silhoutte scores for each amount of clusters
makePlot(number_of_clusters, silhouette_scores)
## Find and print out the cluster amount that gives the highest
## silhouette score.
best_silhouette_score = max(silhouette_scores)
index_of_best_score = silhouette_scores.index(best_silhouette_score)
ideal_number_of_clusters = number_of_clusters[index_of_best_score]
print("")
print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
round(best_silhouette_score, 4)))
findBestClusterer(number_of_clusters)
Please note that in my example, I used a GMM model instead of KNN, but you should be able to slightly modify the findBestClusterer() method to use whatever clustering algorithm you wish. In this method you will also specify your dataset.

In machine learning there are two broad types of learners, namely eager learners (Decision trees, neural nets, svms...) and lazy learners such as KNN. In fact, KNN doesn't do any learning at all. It just stores the "labeled" data you have and then uses it to perform inference such that it computes how similar the new sample (unlabeled) is, to all of the samples in the data that it has stored (labeled data). Then based on majority voting of the K nearest instances (K nearest neighbours hence the name) of the new sample, it will infer it's class/value.
Now to get to your question, "training" the KNN has nothing to do with the K itself, so when performing inference feel free to use whatever K gives the best result for you.

Related

What is considered to be a good silhouette score?

I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.
Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.

To determine the optimal k-mean for given dataset using python

I am pretty new to python and the clusttering stuff. Right now I have a task to analyze a set of data and determine its optimal Kmean by using elbow and silhouette method.
As shown in the picture, my dataset has three features, one is the weight of tested person, the second is the blood Cholesterol content of the person, the third is the gender of the tested person('0' means female, '1' means male)
I firstly use elbow method to see the wcss value at different k values
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
And get the plot at below:
Then, I used the silhouette method to look at the silhouette score:
from sklearn.metrics import silhouette_score
sil = []
for k in range(2, 6):
kmeans = KMeans(n_clusters = k).fit(data)
preds = kmeans.fit_predict(data)
sil.append(silhouette_score(data, preds, metric = 'euclidean'))
plt.plot(range(2, 6), sil)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')
plt.show()
for i in range(len(sil)):
print(str(i+2) +":"+ str(sil[i]))
And I got the following results:
Could anybody suggest how can I pick the optimal Kmean? I did some light research, someone says the higher the s-score the better(in my case the cluster number should be 2?), but in some other cases, they are not simply using the cluster number has the highest score.
Another thought is that here I included the gender as one feature, should I first divide my data into two classes by gender and then cluster them separately ?
K-means algorithm is very much susceptible to the range in which your features are measured, in your case gender is a binary variable which just takes values 0 and 1, but the other two features are measures in a larger scale, I recommend you to normalize your data first and then do the plots again which could produce consistent results between your elbow curve and the silhouette method.

How distinct a real outliers with PYod?

I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.

Weighted k-means in python

After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.

scikit feature importance selection experiences

Scikit-learn has a mechanism to rank features (classification) using extreme randomized trees.
forest = ExtraTreesClassifier(n_estimators=250,
compute_importances=True,
random_state=0)
I have a question if this method is doing a "univariate" or "multivariate" feature ranking. Univariate case is where individual features are compared to each other. I would appreciate some clarifications here. Any other parameters that I should try to fiddle? Any experiences and pitfalls with this ranking methhod are also appreciated.
THe output of this ranking identify feature numbers(5,20,7. I would like to check if the feature number really corresponds to the row in the feature matrix. THat is, the feature number 5 corresponds to the sixth row in the feature matrix (starts with 0).
I'm not an expert but this is not univariate. In fact the total feature importance is computed from the feature importance of each tree (taking the mean value i think).
For each tree, the importances are computed from the impurity of the split.
I used this method and it seems to give good results, better from my point of view than the univariate method. But I don't know any technique to test the results except the knowledge of the dataset.
To order, the feature correctly you should follow this example and modify it a bit like so to use pandas.DataFrame and their proper column names:
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
X = pandas.DataFrame(...)
Y = pandas.Series(...)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
feature_importance = forest.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)[::-1]
print "Feature importance:"
i=1
for f,w in zip(X.columns[sorted_idx], feature_importance[sorted_idx]):
print "%d) %s : %d" % (i, f, w)
i+=1
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
nb_to_display = 30
plt.barh(pos[:nb_to_display], feature_importance[sorted_idx][:nb_to_display], align='center')
plt.yticks(pos[:nb_to_display], X.columns[sorted_idx][:nb_to_display])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

Categories