Optimize Hamming Distance Python - python

I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.
Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:
from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')
The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).
Is there any fastest way to do it?
I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.

If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).
Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.
from sklearn.neighbors import BallTree
import numpy as np
# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))
# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)
print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.
Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.

Related

Precomputed distance matrix in DBSCAN

Reading around, I find it is possible to pass a precomputed distance matrix into SKLearn DBSCAN. Unfortunately, I don't know how to pass it for calculation.
Say I have a 1D array with 100 elements, with just the names of the nodes. Then I have a 2D matrix, 100x100 with the distance between each element (in the same order).
I know I have to call it:
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
For a distance between nodes of 2 and a minimum of 5 node clusters. Also, use "precomputed" to indicate to use the 2D matrix. But how do I pass the info for the calculation?
The same question could apply if using RAPIDS CUML DBScan function (GPU accelerated).
documentation:
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean',
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]
[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If
metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances for its metric parameter. If metric is
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a
Glossary, in which case only “nonzero” elements may be considered neighbors for
DBSCAN.
[...]
So, the way you normally call this is:
from sklearn.cluster import DBSCAN
clustering = DBSCAN()
DBSCAN.fit(X)
if you have a distance matrix, you do:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)

Is there a way find nearest neighbors with BallTree or KDTree using cosine similarity?

I have very sparse and huge rating data which I should find top k neighbors for each session. I need to compare approximate and exact nearest neighbor algorithms but since the data is very big and sparse the computation of the exact method is taking days to compute with brute force. I want to use KD Trees or Ball Trees but they are not supporting cosine distance. Is there a way to convert other distance measures to cosine similarity by math or is there any other way to compute exact neighborhood?
try to normalize your matrix and use euclidian metric.

k-means using signature matrix generated from minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.
My question is: does it make sense to use this signature matrix to perform k-means clustering?
I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.
[Edited to highlight the question]
TL;DR
Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.
Some explanations
I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.
I'm guessing you're doing something like:
# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data.
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
minhash = MinHash(num_perm=n_perms)
for gram in ngrams(string, 3):
minhash.update("".join(gram).encode('utf-8'))
minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)
This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.
To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.
Alternative way of text clustering
What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:
So your problem becomes easily solved:
# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
# Get your data
data = get_your_list_of_strings_to_cluster()
# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)
# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)
And there you go. From there:
Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
Try other clusterers (f/ex I find
HDBSCAN miles better
than kmeans - quicker, more robust, more accurate, less tuning).
Hope this helps.
Tom

Pandas Matrix to Distance Matrix as fast as possible

I want to calculate a NxN similarity Matrix using the cosine distance formula of sklearn. My problem is that my Matrix is very very large. It has about 1000 entries. My current approach is very very slow and I need a real speed-up. Can anybody help me speeding the code up?
for i in similarity_matrix.columns:
for j in similarity_matrix.columns:
if i == j:
similarity_matrix.ix[i,j] = 0
else:
similarity_matrix.ix[i,j] = cosine(documents[int(i)], documents[int(j)])
Bonus task: In addition I would like to use the weighted cosine formula. But it seems not to be implemented in sklearn? Is that true?
Using for-loops is not the ideal solution. I would recommend to fall back to the pdist functions of scipy. My read is that you don't mean your matrix has 1000 entries but 1000x1000? However Scipy can handle this easily.
import numpy as np
from scipy.spatial.distance import pdist
res = pdist(documents.T, 'cosine')
distances = 1-pd.DataFrame(squareform(res), index=documents.columns, columns=documents.columns)
I have problems understanding how your weight vector looks like? Is is a constant value? Pdist allows for adding custom functions. For example you can calculate your cosine distance using numpy (which is also really fast)
pdist(X, lambda u, v: np.dot(np.dot(u, v), weightvec) / (norm(np.multiply(u, weightvec)) * norm(np.multiply(v, weightvec))))

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

Categories