Pairwise distance in very large datasets - python

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)

The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Related

Matrix of "for x in vectors: for y in vectors:", without two for loops

Can one get the cartesian outer product of two set of vectors without using two for loops? It is slow because the data is large.
[[f(x,y) for x in vectors] for y in vectors]
I am trying to do an agglomerative clustering project in python and for this, I need to create a distance matrix.
This is the code that I have to define the function for the distance matrix:
def distance_matrix(vectors):
s = np.zeros((len(vectors), len(vectors)))
for i in range(len(vectors)):
for v in range(len(vectors)):
s[i, v] = dissimilarity(vectors[i], vectors[v])
return s
What it should do is take a take a list of NumPy arrays and return a 2D NumPy array d where the entry d[i,j] should contain the dissimilarity between vectors[i] and vectors[j].
In this case, vectors is the list of NumPy arrays and the dissimilarity is calculated by:
def dissimilarity(v1, v2):
return (1-(v1.dot(v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))))
or in other words, the dissimilarity is the cosine dissimilarity between two 1D NumPy arrays.
My goal is to find a way to get the distance matrix without the double for loops but still have the computational time be very small.
In general one cannot do this. In general (but not in this case), the computation time will not change because the work is physically there and exists and nomatter the accounting tricks one might try to rewrite it (a single for loop that acts like two for loops, or a recursive implementation), at the end of the day the work must still be done irrespective of the ordering you do it.
Is there a way to get rid of the for loops? Even if it slows down the run time a bit. – Cat_Smithyyyy03
Your case is special. While one normally has no reason to consider a speedup is possible, you are asking to consider tensor products, or in this case a matrix product. When you think about general matrix multiplication AB=C, a matrix product C is basically the outer product {a dot b} of all vectors a in the rowspace of A and b in the colspace of B.
So, for your case, first normalize all your vectors (so the normalization is not required later), then stack them to form A, then let B=transp(A), then do a matrix multiply.
[[a0 a1 a2] [[a0 [b0 [z0 [aa ab ac .. az
[b0 b1 b2] a1 b1 ... z1 ba bb bc .. bz
( . )( a2] b2] z2]] ) = ( . . )
. . .
. . .
[z0 z1 z2]] za zb zc zz]
Interestingly, now you can then plug this into a fast matrix-multiply algorithm, which is actually faster than O(dim * #vecs^2). Hopefully it's also optimized for self-transpose matrices that would generate symmetric matrices (which might save a factor of 2 work... maybe it has some flag like matrixmult(a,b,outputWillBeSymmetric)).
This is faster than "O(N^2)", unintuitively: This rewrite exposed a substructure in the problem, which can be leveraged to get faster than O(dim * #vecs^2). The leveragable substructure is namely the fact that you are computing the outer product of THE SAME vectors. The fast matrix-multiply algorithm will leverage this.
edit: my original answer was wrong
You have a set of size N and you wish to compute f(a,b) for all a and b in the set.
Unless you know some values are trivial, there is no way to be asymptotically faster than this, because you have to imagine the worst case: Every pair f(a,b) may be unique... so there's no way to do less than roughly N^2 work.
However since your function f is symmetric, you could do half the work, then duplicate it:
N = len(vectors)
for i in range(N):
for v in range(N):
dissim = #...
s[i,v] = dissim
s[v,i] = dissim
(You can avoid calculating your metric in the reflexive case f(a,a) because it's trivial, but that doesn't make things asymptotically faster since the fraction of such work N/N^2 tends to zero as N increases, so it's not that great an optimization... it is reasonable only if you are working with a very small number of vectors.)
Whether you should optimize further depends on whether you need to. Such code should be able to easily handle millions of small vectors. Next steps:
Is there something fishy that is making my code very slow? What could it be? We don't have enough information to comment.
If there is nothing fishy of the sort, you can try rephrasing as matrix operations, in order to stay as much as possible in numpy's optimized C routines instead of bouncing back and forth into python. This is ugly and I would avoid doing it, because your code readability will decrease.
If you are dealing with hundreds of millions of vectors, perhaps consider a more cache-friendly approach where you do for blockI in range(N//10**6): for blockV in range(N//10**6): for i in range(blockI*10**6, (blockI+1)*10**6): for v in range(blockV*...):
If you're dealing with billions of vectors, then look into leveraging gpgpu. This is quite ideal for the gpu and might be a factor of thousands speedup.

python3 (nltk/numpy/etc): ISO efficient way to compute find pairs of similar strings

I have a list of N strings. My task is to find all pairs of strings that are sufficiently similar. That is, I need (i) a similarity metric that would produce a number in a predefined range (say between 0 and 1) that measures how similar the two strings are and (ii) a way of going through O(N^2) pairs quickly to find those that are above some sort of threshold (say >= 0.9 if the metric gives larger numbers for more similar strings). What I am doing now is pretty slow (as one might expect) for a large N:
import difflib
num_strings = len(my_strings)
for i in range(num_strings):
s_i = my_strings[i]
for j in range(i+1,num_strings):
s_j = my_strings[j]
sim = difflib.SequenceMatcher(a=s_i, b=s_j).ratio()
if sim >= thresh:
print("%s\t%s\t%f" % (s_i,s_j,sim))
Questions:
What would be a good way of vectorizing this double loop to speed it
up maybe using NLTK, numpy or any other library?
Would you recommend a better metric than difflib's ratio (again, from NLTK, numpy etc)?
Thank you
If you want the optimal solution you have to be O(n^2), if you want an approximate of the optimal solution you can select a threshold and delete pairs that have a fair similarity ratio.
I would suggest you use another metrics since you're adding complexity with the difflib's ratio (it depends on the length of the strings). These ratio could be entropy or manhattan/euclidean distance.

fast comparison of large amount of list of lists

Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])
Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.

Efficiently update values held in scoring matrix

I am continuously calculating correlation matrices where each time the order of the underlying data is randomized. When a correlation score with randomized data is greater than or equal to the original correlation determined with ordered data, I would like to update the corresponding cell in a scoring matrix with +1. (All cells begin as zeroes in the scoring matrix).
Due to the size of the matrices I am dealing with shape = (3681, 12709), I would like to find out an efficient way of doing this. So far, what I have is inefficient and takes too long. I wonder if there is a matrix-operation style approach to this rather than iterating, as I am currently doing below:
for i, j in product(data_sorted.index, data_sorted.columns):
# if random correlation is as good as or better than sorted correlation
if data_random.loc[i, j] >= data_sorted.loc[i, j]:
# update scoring matrix
scoring_matrix[sorted_index_list.index(i)][sorted_column_list.index(j)] += 1
I have crudely timed this approach and found that doing this for a single line of my matrix will take roughly 4.2 seconds which seems excessive.
Any help would he much obliged.
Assuming everything has the same indices, this should work as expected and be pretty quick.
scoring_matrix += (data_random >= data_sorted).astype(int)

Efficient implementation of the transition matrix for page rank

I'm trying to implement PageRank. I'm reading the description here: http://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html
Everything is very clear to me, however I'm concerned about the construction of the matrix $P$. I find that constructing $P$ the naive way would be very expensive. For example: to implement step 1, one would need to check every row of $A$ and then check every element of that row to see if all elements are zero. For step 2 one would need to compute the number of ones for each row. I can imagine my code to have nasty slow loops. I was wondering if there are smart linear algebra techniques that could efficiently construct $P$. I will be using python numpy for my coding.
EDIT: one way I'm thinking now to solve this is by doing a summation element wise over the columns of $A$. By that I would have a column vector. Now I will go through each element of this vector to check which elements are zeros. Thus I can now know which rows has no 1s and I can multiply those rows with $1/N$.
Your concern is correct. Since the number of web pages (vertices in the representing graph) is huge, it is impossible to actually generate such A and work on it.
The matrix calculation of page rank can be much more efficiently calculated using sparse matrix implementations, since the matrix is very sparse. Most webpages are not actually connected to each other, so most entries in the matrix are 0.
The sparse matrix is built as follows:
Build matrix A as described A_ij = 1 if (i,j) is an edge, otherwise A_ij = 0
Step 1 is usually not made, and instead we remove 'sinks' iteratively. This is done to prevent the matrix being dense, some alternatives are also linking 'sinks' back to the nodes that linked to them, or link a sink to itself.
Divide each 1 in A as described in (2)
Let's denote the resulting matrix as M, and this is the resulting matrix we will work on, in order to get a column vector p (which is initialized with 1/n for each entry).
x = [1/n, 1/n, ... , 1/n]^T //a column vector
p = [1/n, 1/n, ... , 1/n]^T //a column vector with the initial ranks
M = genSparseMatrix() //as described above
do until p converge:
p = (1-\alpha)* M*p + (\alpha) * x
return p
In the end, this yields p, the column vector that holds the page rank value for each node.

Categories