I'm clustering data with DBSCAN in order to remove outliers. The computation is very memory consuming because the implementation of DBSCAN in scikit-learn can't handle almost 1 GB of data. The problem was already mentioned here
The bottleneck of the following code appears to be the matrix calculation, which is very memory consuming (size of matrix: 10mln x 10mln). Is there a way to optimize the computation of DBSCAN?
My brief research shows that the matrix should be reduced to a sparse matrix in some way to make it feasible to compute.
My ideas how to solve this problem:
create and calculate a sparse matrix
calculate parts of matrix and save them to files and merge them later
perform DBSCAN on small subsets of data and merge the results
switch to Java and use ELKI tool
Code:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
# sample data
speed = np.random.uniform(0,25,1000000)
power = np.random.uniform(0,3000,1000000)
# create a dataframe
data_dict = {'speed': speed,
'power': power}
df = pd.DataFrame(data_dict)
# convert to matrix
df = df.as_matrix().astype("float64", copy = False)
X = data
# normalize data
X = StandardScaler().fit_transform(X)
# precompute matrix of distances
dist_matrix = sklearn.metrics.pairwise.euclidean_distances(X, X)
# perform DBSCAN clustering
db = DBSCAN(eps=0.1, min_samples=60, metric="precomputed", n_jobs=-1).fit(dist_matrix)
1 to 3 will not work.
Your data is dense. There aren't "mostly 0s", so sparse formats will actually need much more memory. The exact thresholds vary, but as a rule of thumb, you'll need at least 90% of 0s for sparse formats to become effective.
DBSCAN does not use a distance matrix.
Working on parts, then merging isn't that easy (there is GriDBSCAN, which does this for Euclidean fistance). You cannot just take random partitions and merge them later.
Related
I labeled lots of object images using Google Vision API. Using those labels (list in pickle here), I created a label co-occurrence matrix (download as numpy array here). Size of the matrix is 2195x2195.
Loading the data:
import pickle
import numpy as np
with open('labels.pkl', 'rb') as f:
labels = pickle.load(f)
cooccurrence = np.load('cooccurrence.npy')
I would like to use a clustering analysis to define reasonable amount of clusters (defined as lists of Vision labels) which would represent some objects (e.g. cars, shoes, books, ....). I do not know what is the right number of clusters.
I tried hierarchical clustering algorithm available in scikit-learn:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 1000)
#creating non-symetrical "similarity" matrix:
occurrences = cooccurrence.diagonal().copy()
similarities = cooccurrence / occurrences[:,None]
#clustering:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters=200, affinity='euclidean', linkage='ward').fit_predict(similarities)
#results in pandas:
df_clusters = pd.DataFrame({'cluster': clusters.tolist(), 'label': labels})
df_clusters_grouped = df_clusters.groupby(['cluster']).agg({'label': [len, list]})
df_clusters_grouped.columns = [' '.join(col).strip() for col in df_clusters_grouped.columns.values]
df_clusters_grouped.rename(columns = {'label len': 'cluster_size', 'label list': 'cluster_labels'}, inplace=True)
df_clusters_grouped.sort_values(by=['cluster_size'], ascending=False)
Like this, I was able to create 200 clusters where one can look like:
["Racket", "Racquet sport", "Tennis racket", "Rackets", "Tennis", "Racketlon", "Tennis racket accessory", "Strings"]
This somehow works, but I would rather use some soft clustering method which would be able to assign one label to multiple clusters (for instance "leather" might make sense for shoes and wallets). Also, I had to define number of clusters (200 in my example code), which is something I would rather get as a result (if possible).
I was also playing with hdbscan, k-clique and Gaussian mixture models but I did not come up with any better output.
Clustering methods such as AgglomerativeClustering of sklearn require a data matrix as input. With metric="precomputed" you can also use a distance matrix (it for k-means and Gaussian mixture modeling, these do need coordinate data).
You, however, have a cooccurrence or simarity matrix. These values have the opposite meaning, so you'll have to identify an appropriate transformation (for example occurrences-cooccurrences). Treating the cooccurrence matrix as data matrix (and then using Euclidean distance - that is what you do) works to some extend but has very weird semantics and is not recommended.
My dataset has 2000 attributes and 200 samples. I need to reduce the dimensionality of it. To do this, I am trying to use Fourier transformation as a dimensional reduction. Fourier transformation returns the discrete Fourier transform when I feed data as an input. But I do not know how to use it for dimensional reduction.
from scipy.fftpack import fft
import panda as pd
price = pd.read_csv(priceFile(), sep=",")
transformed = fft(price )
Can you please help me?
Fourier transform is most suited if your samples are each a time series. If they are you may extract frequency domain features for each sample from transformed. Here is a listing of common features in time and frequency domain that you can consider (reference):
Let's said you have a Pandas data frame with 2000 atributes and 200 samples as you mentioned:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(200, 2000)))
To reduce the dimensionality using scipy, you can generate a new an array with the transformed values by first setting the number of dimensions (n_dimensions) that you want and the calling the scipy function (fft).
First we call the function and we define it as fft
from scipy.fftpack import fft
Then we set the number of dimensions in this case we will assign 1 dimension
n_dimensions = 1
Then we call the function and we add our data frame first and the number of dimensions.
transformed_data = fft(df,n=n_dimensions)
Then if we want to work with Real numbers you can transform the array
df = df.real
I have a pandas dataframe with rows as records (patients) and 105 columns as features.(properties of each patient)
I would like to cluster, not the patients, not the rows as is customary, but the columns so I can see which features are similar or correlated to which other features. I can already calculate the correlation each feature with every other feature using df.corr(). But how can I cluster these into k=2,3,4... groups using sklearn.cluster.KMeans?
I tried KMeans(n_clusters=2).fit(df.T) which does cluster the features (because I took the transpose of the matrix) but only with a Euclidian distance function, not according to their correlations. I prefer to cluster the features according to correlations.
This should be very easy but I would appreciate your help.
KMeans is not very useful in this case, but you can use any clustering method that can work with distances matrix. For example - agglomerative clustering.
I'll use scipy, sklearn version is simpler, but not such powerful (e.g. in sklearn you cannot use WARD method with distances matrix).
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
df = ... # your dataframe with many features
corr = df.corr() # we can consider this as affinity matrix
distances = 1 - corr.abs().values # pairwise distnces
distArray = ssd.squareform(distances) # scipy converts matrix to 1d array
hier = hierarchy.linkage(distArray, method="ward") # you can use other methods
Read docs to understand hier structure.
You can print dendrogram with
dend = hierarchy.dendrogram(hier, truncate_mode="level", p=30, color_threshold=1.5)
And finally, obtain cluster labels for your features
threshold = 1.5 # choose threshold using dendrogram or any other method (e.g. quantile or desired number of features)
cluster_labels = hierarchy.fcluster(hier, threshold, criterion="distance")
Create a new matrix by taking the correlations of all the features df.corr(), now use this new matrix as your dataset for the k-means algorithm.
This will give you clusters of features which have similar correlations.
I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).
However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage matrix Z? My question boils down to: how can I compute what cut_tree() is computing from scratch with no bugs?
You can test your code with this dataset.
from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist
## Load dataset
X = np.load("dataset.npy")
## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')
print(is_valid_linkage(Z))
## Now let's say we want the flat cluster assignement with 10 clusters.
# If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)
Sidenote: An alternative approach could maybe be using rpy2's cutree() as a substitute for scipy's cut_tree(), but I never used it. What do you think?
One way to obtain k flat clusters is to use scipy.cluster.hierarchy.fcluster with criterion='maxclust':
from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')
I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.
My question is: does it make sense to use this signature matrix to perform k-means clustering?
I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.
[Edited to highlight the question]
TL;DR
Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.
Some explanations
I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.
I'm guessing you're doing something like:
# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data.
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
minhash = MinHash(num_perm=n_perms)
for gram in ngrams(string, 3):
minhash.update("".join(gram).encode('utf-8'))
minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)
This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.
To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.
Alternative way of text clustering
What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:
So your problem becomes easily solved:
# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
# Get your data
data = get_your_list_of_strings_to_cluster()
# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)
# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)
And there you go. From there:
Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
Try other clusterers (f/ex I find
HDBSCAN miles better
than kmeans - quicker, more robust, more accurate, less tuning).
Hope this helps.
Tom