Tag based co-occurrence image clustering - python

I labeled lots of object images using Google Vision API. Using those labels (list in pickle here), I created a label co-occurrence matrix (download as numpy array here). Size of the matrix is 2195x2195.
Loading the data:
import pickle
import numpy as np
with open('labels.pkl', 'rb') as f:
labels = pickle.load(f)
cooccurrence = np.load('cooccurrence.npy')
I would like to use a clustering analysis to define reasonable amount of clusters (defined as lists of Vision labels) which would represent some objects (e.g. cars, shoes, books, ....). I do not know what is the right number of clusters.
I tried hierarchical clustering algorithm available in scikit-learn:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 1000)
#creating non-symetrical "similarity" matrix:
occurrences = cooccurrence.diagonal().copy()
similarities = cooccurrence / occurrences[:,None]
#clustering:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters=200, affinity='euclidean', linkage='ward').fit_predict(similarities)
#results in pandas:
df_clusters = pd.DataFrame({'cluster': clusters.tolist(), 'label': labels})
df_clusters_grouped = df_clusters.groupby(['cluster']).agg({'label': [len, list]})
df_clusters_grouped.columns = [' '.join(col).strip() for col in df_clusters_grouped.columns.values]
df_clusters_grouped.rename(columns = {'label len': 'cluster_size', 'label list': 'cluster_labels'}, inplace=True)
df_clusters_grouped.sort_values(by=['cluster_size'], ascending=False)
Like this, I was able to create 200 clusters where one can look like:
["Racket", "Racquet sport", "Tennis racket", "Rackets", "Tennis", "Racketlon", "Tennis racket accessory", "Strings"]
This somehow works, but I would rather use some soft clustering method which would be able to assign one label to multiple clusters (for instance "leather" might make sense for shoes and wallets). Also, I had to define number of clusters (200 in my example code), which is something I would rather get as a result (if possible).
I was also playing with hdbscan, k-clique and Gaussian mixture models but I did not come up with any better output.

Clustering methods such as AgglomerativeClustering of sklearn require a data matrix as input. With metric="precomputed" you can also use a distance matrix (it for k-means and Gaussian mixture modeling, these do need coordinate data).
You, however, have a cooccurrence or simarity matrix. These values have the opposite meaning, so you'll have to identify an appropriate transformation (for example occurrences-cooccurrences). Treating the cooccurrence matrix as data matrix (and then using Euclidean distance - that is what you do) works to some extend but has very weird semantics and is not recommended.

Related

Detect cluster outliers

I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.

How to cluster *features* based on their correlations to each other with sklearn k-means clustering

I have a pandas dataframe with rows as records (patients) and 105 columns as features.(properties of each patient)
I would like to cluster, not the patients, not the rows as is customary, but the columns so I can see which features are similar or correlated to which other features. I can already calculate the correlation each feature with every other feature using df.corr(). But how can I cluster these into k=2,3,4... groups using sklearn.cluster.KMeans?
I tried KMeans(n_clusters=2).fit(df.T) which does cluster the features (because I took the transpose of the matrix) but only with a Euclidian distance function, not according to their correlations. I prefer to cluster the features according to correlations.
This should be very easy but I would appreciate your help.
KMeans is not very useful in this case, but you can use any clustering method that can work with distances matrix. For example - agglomerative clustering.
I'll use scipy, sklearn version is simpler, but not such powerful (e.g. in sklearn you cannot use WARD method with distances matrix).
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
df = ... # your dataframe with many features
corr = df.corr() # we can consider this as affinity matrix
distances = 1 - corr.abs().values # pairwise distnces
distArray = ssd.squareform(distances) # scipy converts matrix to 1d array
hier = hierarchy.linkage(distArray, method="ward") # you can use other methods
Read docs to understand hier structure.
You can print dendrogram with
dend = hierarchy.dendrogram(hier, truncate_mode="level", p=30, color_threshold=1.5)
And finally, obtain cluster labels for your features
threshold = 1.5 # choose threshold using dendrogram or any other method (e.g. quantile or desired number of features)
cluster_labels = hierarchy.fcluster(hier, threshold, criterion="distance")
Create a new matrix by taking the correlations of all the features df.corr(), now use this new matrix as your dataset for the k-means algorithm.
This will give you clusters of features which have similar correlations.

sklearn specifying number of clusters

For the clustering algorithms in sklearn, is there a way to specify how many clusters you want the algorithm to find (instead of the algorithm finding its own number of clusters)? From my inputted data, I'm hoping for 2 clusters instead of the 3 it outputs for me.
If it helps, I'm using the MeanShift algorithm (but my question applies to all of them). Also, most tutorials seem to use make_blobs, but I'm using pandas's read_csv to upload my data instead if that changes anything.
This is the beginning part of my code:
df = pd.read_csv(filename, header = 0)
original_headers = list(df.columns.values)
df = df._get_numeric_data()
data = df.values
ms = MeanShift()
ms.fit(data)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
As some users said above, it is not possible set the number of clusters wanted in MeanShift algorithm.
When we talk about clustering, there are a lot of models to be employed depending on your problem. Density based models, like MeanShift and DBSCAN, try to find areas of higher density than the remainder of the data set. So, the number of clusters will be defined by the data itself.
On the other hand, for example, centroid based methods like K-Means, starts its iterations based on the number of centroids passed as parameter.
The following link shows a lot of clustering algorithms of sklearn. Try to figure out which one suits best in your problem.
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
References:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
https://en.wikipedia.org/wiki/Cluster_analysis

Alternative to scipy.cluster.hierarchy.cut_tree()

I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).
However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage matrix Z? My question boils down to: how can I compute what cut_tree() is computing from scratch with no bugs?
You can test your code with this dataset.
from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist
## Load dataset
X = np.load("dataset.npy")
## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')
print(is_valid_linkage(Z))
## Now let's say we want the flat cluster assignement with 10 clusters.
# If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)
Sidenote: An alternative approach could maybe be using rpy2's cutree() as a substitute for scipy's cut_tree(), but I never used it. What do you think?
One way to obtain k flat clusters is to use scipy.cluster.hierarchy.fcluster with criterion='maxclust':
from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')

k-means using signature matrix generated from minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.
My question is: does it make sense to use this signature matrix to perform k-means clustering?
I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.
[Edited to highlight the question]
TL;DR
Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.
Some explanations
I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.
I'm guessing you're doing something like:
# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data.
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
minhash = MinHash(num_perm=n_perms)
for gram in ngrams(string, 3):
minhash.update("".join(gram).encode('utf-8'))
minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)
This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.
To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.
Alternative way of text clustering
What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:
So your problem becomes easily solved:
# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
# Get your data
data = get_your_list_of_strings_to_cluster()
# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)
# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)
And there you go. From there:
Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
Try other clusterers (f/ex I find
HDBSCAN miles better
than kmeans - quicker, more robust, more accurate, less tuning).
Hope this helps.
Tom

Categories