Pairwise Earth Mover Distance across all documents (word2vec representations) - python

Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity.

The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document.
I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.
So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.
For example, assuming your list of documents (each a list-of-words) is docs, a gensim word-vector model in model, and numpy imported as np, you could calculate the array of pairwise distances D with:
D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
for j in range(len(docs)):
if i == j:
continue # self-distance is 0.0
if i > j:
D[i, j] = D[j, i] # re-use earlier calc
D[i, j] = model.wmdistance(docs[i], docs[j])
It may take a while, but you'll then have all pairwise distances in array D.

On top of the accepted answer you may want to use the faster wmd library wmd-relax.
The example then could be adjusted to:
D[i, j] = docs[i].similarity(docs[j])

Related

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)
The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

I am trying to implement textrank algorithm where I am calculating cosine-similarity matrix for all the sentences.I want to parallelize the task of similarity matrix creation using Spark but don't know how to implement it.Here is the code:
cluster_summary_dict = {}
for cluster,sentences in tqdm(cluster_wise_sen.items()):
sen_sim_matrix = np.zeros([len(sentences),len(sentences)])
for row in range(len(sentences)):
for col in range(len(sentences)):
if row != col:
sen_sim_matrix[row][col] = cosine_similarity(cluster_dict[cluster]
[row].reshape(1,100), cluster_dict[cluster]
[col].reshape(1,100))[0,0]
sentence_graph = nx.from_numpy_array(sen_sim_matrix)
scores = nx.pagerank(sentence_graph)
pagerank_sentences = sorted(((scores[k],sent) for k,sent in enumerate(sentences)),
reverse=True)
cluster_summary_dict[cluster] = pagerank_sentences
Here,cluster_wise_sen is a dictionary that contains list of sentences for different clusters ({'cluster 1' : [list of sentences] ,...., 'cluster n' : [list of sentences]}). cluster_dict contains the 100d vector representation of the sentences. I have to compute the sentence similarity matrix for each cluster. Since it is time consuming, therefore looking to parallelize it using spark.
The experiments with large scale matrix calculation for cosine similarity are well written in here!
To achieve speed and not compromising much on the accuracy, you can also try hashing methods like Min-Hash and evaluate Jaccard Distance similarity. It comes with a nice implementation with Spark ML-lib, the documentation has very detailed examples for reference: http://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

fast comparison of large amount of list of lists

Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])
Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.

Python: Single linkage clustering algorithm

I am new to Python and I am looking for an example of a naive, simple single linkage clustering python algorithm that is based on creating a proximity matrix and removing nodes from that. I know that there are packages such as numpy but I would rather avoid them.
I have searched online but couldn't find any code simple enough to be able to understand in order to replicate it myself afterwards.
Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
Find the most similar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.
Increment the sequence number: m = m + 1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)]
Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined as d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)].
If all objects are in one cluster, stop. Else, go to step 2.
These are the steps as described in wikipedia. I have created the distance matrix but not sure how to proceed form there.
This is what I have so far:
comparing
def comparison(protein1, protein2):
l = [i for i in range(len(protein1)) if protein1[i] != protein2[i]]
return len(l)
creating the matrix
def matrix (r1,r2):
r = []
for p1 in proteins:
r2 = []
for p2 in proteins:
r2 += [comparison(p1, p2)]
r += [r2]
return r
These are the sequences I am trying to compare:
seqlist = { "Human": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHG", "Chimpanzee": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHG", "Western tarsier":"MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGXNLHG", "Mouse": "MGDAEAGKKIFVQKCAQCHTVEKGGKHKTGPNLWG", "Rabbit": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Dog": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Pig": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Snapping turtle":"MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLNG", "Alligator": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Honeybee": "AGDPEKGKKIFVQKCAQCHTIESGGKHKVGPNLYG", }
You should look at the package scipy which has several hierarchical clustering algorithms implemented (see scipy.cluster.hierarchy). Look for the function pdist in the scipy.spatial module.
You should be able to get lots of nice usage examples from there.
See http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

Efficiently determine "how sorted" a list is, eg. Levenshtein distance

I'm doing some research on ranking algorithms, and would like to, given a sorted list and some permutation of that list, calculate some distance between the two permutations. For the case of the Levenshtein distance, this corresponds to calculating the distance between a sequence and a sorted copy of that sequence. There is also, for instance, the "inversion distance", a linear-time algorithm of which is detailed here, which I am working on implementing.
Does anyone know of an existing python implementation of the inversion distance, and/or an optimization of the Levenshtein distance? I'm calculating this on a sequence of around 50,000 to 200,000 elements, so O(n^2) is far too slow, but O(n log(n)) or better should be sufficient.
Other metrics for permutation similarity would also be appreciated.
Edit for people from the future:
Based on Raymond Hettinger's response; it's not Levenshtein or inversion distance, but rather "gestalt pattern matching" :P
from difflib import SequenceMatcher
import random
ratings = [random.gauss(1200, 200) for i in range(100000)]
SequenceMatcher(None, ratings, sorted(ratings)).ratio()
runs in ~6 seconds on a terrible desktop.
Edit2: If you can coerce your sequence into a permutation of [1 .. n], then a variation of the Manhattan metric is extremely fast and has some interesting results.
manhattan = lambda l: sum(abs(a - i) for i, a in enumerate(l)) / (0.5 * len(l) ** 2)
rankings = list(range(100000))
random.shuffle(rankings)
manhattan(rankings) # ~ 0.6665, < 1 second
The normalization factor is technically an approximation; it is correct for even sized lists, but should be (0.5 * (len(l) ** 2 - 1)) for odd sized lists.
Edit3: There are several other algorithms for checking list similarity! The Kendall Tau ranking coefficient and the Spearman ranking coefficient. Implementations of these are available in the SciPy library as scipy.stats.kendalltau and scipy.stats.rspearman, and will return the ranks along with the associated p-values.
Levenshtein distance is an O(n**2) algorithm, so if you want to go faster, use the alternative fast algorithm in the difflib module. The ratio method computes a measure of similarity between two sequences.
If you have to stick with Levenshtein, there is a Python recipe for it on the ASPN Python Cookbook: http://code.activestate.com/recipes/576874-levenshtein-distance/ .
Another Python script can be found at: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

Categories