fast comparison of large amount of list of lists

fast comparison of large amount of list of lists - python

Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])

Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.

Related

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)

The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Generate a specific number of permutations

I have browsed SO extensively and I have found many questions about generating all possible permutations, but none regarding generating a specific number of permutations.
I developed, thanks to many SO questions, a decent permutation test routine. However I have to repeat it many times, and it is taking a too long time.
my code:
def exact_mc_perm_test(ys, nmc,boolean_selection):
# xs sample from a time series
# ys all time series
# print nmc
# sample difference in mean
mean_ys = np.mean(ys)
diff = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k=0
for j in np.arange(nmc):
# in place shuffling
np.random.shuffle(ys)
# difference now between fixed all time series and shuffled subsamplevalues
diff_shuffled = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k += diff < diff_shuffled
return k / nmc
I took this SO answer and modify it for my specific test.
I have to run it over a 3D array stored in an xarray. the dataset has (lon,lat,time) coordinates, I need to run it for each (lon,lat) position (along the time dimension)
I run it using chain.iteratools:
for ii in chain.from_iterable(zip(*dataset.variable())):
iis = ii[selected_position].values
ind_x =dataset.lon==ii.lon
ind_y =dataset.lat==ii.lat
dataset.perm_test[ind_y, ind_x] = exact_mc_perm_test1(iis, ii.values, 1000.,selected_position)
Ideally I want to run a permutation test with 20000 permutations. The two loops (within (lon,lat) and for 20000 shuffles) adds up.
I am looking to speed up the permutation test code.
Therefore I though about trying to generate a 2D array of shape (len(ys),20000) with essentially 20000 shuffled ys array, and then access them at ones and calculate the 20000 differences (diff in the code). (Or find a trade off between memory usage and the looping, so maybe do 5 loops for 4000 shuffles at the time).
I could not figure out or find a way to do this.
The permutations command from itertools generates all the possible permutations which in my case are too many to handle.
I have looked at the random library but couldn't find something that fits my need. Any suggestion?

Take a look at compress() and permutations() from the itertools:
for perm in compress(permutation(iterable, r=length), boolean_selection):
print(perm)

python3 (nltk/numpy/etc): ISO efficient way to compute find pairs of similar strings

I have a list of N strings. My task is to find all pairs of strings that are sufficiently similar. That is, I need (i) a similarity metric that would produce a number in a predefined range (say between 0 and 1) that measures how similar the two strings are and (ii) a way of going through O(N^2) pairs quickly to find those that are above some sort of threshold (say >= 0.9 if the metric gives larger numbers for more similar strings). What I am doing now is pretty slow (as one might expect) for a large N:
import difflib
num_strings = len(my_strings)
for i in range(num_strings):
s_i = my_strings[i]
for j in range(i+1,num_strings):
s_j = my_strings[j]
sim = difflib.SequenceMatcher(a=s_i, b=s_j).ratio()
if sim >= thresh:
print("%s\t%s\t%f" % (s_i,s_j,sim))
Questions:
What would be a good way of vectorizing this double loop to speed it
up maybe using NLTK, numpy or any other library?
Would you recommend a better metric than difflib's ratio (again, from NLTK, numpy etc)?
Thank you

If you want the optimal solution you have to be O(n^2), if you want an approximate of the optimal solution you can select a threshold and delete pairs that have a fair similarity ratio.
I would suggest you use another metrics since you're adding complexity with the difflib's ratio (it depends on the length of the strings). These ratio could be entropy or manhattan/euclidean distance.

Does this shuffling algorithm produce each permutation with uniform probability?

I've seen how a particular naive shuffling algorithm is biased, and I feel like I basically get that, and I get how the Fischer-Yates algorithm is not biased. I have the following algorithm which was the one I first thought of when I thought about how to shuffle a list. I know it consumes twice the memory and runs in unnecessarily large time, but I'm still curious if it produces each permutation with a uniform distribution, or if there's some sneaky reason I'm not seeing for it to be biased.
I'm also kind of wondering if there is some other "undesirable" property to a random shuffle that this would have, like perhaps the probabilities of various positions in the list being filled with some values are dependent.
def shuf(x):
out = [None for i in range(len(x))]
for i in x:
pos = rand.randint(0,len(x)-1)
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
out[pos] = i
return out
I generated a heat map of this on a list of 20 elements, running 10^6 trials, and it produced the following. The (i,j) coordinate of the map represents the probability of the ith position of the list being filled with the jth element of the original list.
While I don't see any pattern to the heat map, it looks like the variance might be high. Or that might be the heat map over-stating the variance because, hey, the minimum and max have to come up somewhere.

Undesirable property - this can be expensive if you're shuffling a large set:
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
Imagine len(x) == 100,000,000 and you've placed 90,000,000 already - you're going to loop a LOT before you get a hit.
Interesting exercises:
What does the heat map look like for simply generating random numbers between 1 and len(x) over 10e6 iterations look like?
What does the heat map look like for Fischer-Yates, for comparison?
At a glance, it looks to me like, given a uniform RNG, it should yield a truly random distribution (albeit more slowly than Fischer-Yates).

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.