Find 10 closest points in decreasing order - python

I am trying to find the distance between a point and other 40,000 points.
Each point is a 300 dimension vector.
I am able to find the closes point. How do I find the 10 nearest points in decreasing order?
Function for closest point:
from scipy.spatial import distance
def closest_node(node,df):
closest_index = distance.cdist([node],df.feature.tolist()).argmin()
return pd.Series([df.title.tolist([closest_index],df.id.tolist()[closest_index]])
This command returns the closest title and id:
df3[["closest_title","closest_id"]]=df3.feature.apply(lambda row: closest_node(row,df2))
df2- pandas dataframe of 40,000 points (each 300 dimension)
How do I return the title and index for the 10 closest points
Thanks

Just slice the sorted distance matrix for the top 10 nodes.
Something like this:
from scipy.spatial import distance
# Find the query node
query_node = df.iloc[10] ## Not sure what you're looking for
# Find the distance between this node and everyone else
euclidean_distances = df.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort("dist", inplace=True)
# nodes
smallest_dist_ixs = distance_frame.iloc[1:10]["idx"]
most_similar_nodes = df.iloc[int(smallest_dist_ixs)]
My assumption based on the word 'title' you have used here, and the choice of 300 dimensional vectors, is that these are word or phrase vectors.
Gensim actually has a manner to get the top N number of similar words based on this idea, which is reasonably fast.
https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html
>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
For something slightly different, this is also slightly similar to the traveling salesman problem (TSP) if you want to get shortest paths between all points, and then simply slice out the first 10 'cities'.
Google has a pretty simple and quick python implementation with OR-Tools here: https://developers.google.com/optimization/routing/tsp.

As I don't know your complete code of have a sample of data, here would be my suggestion:
Instead of using ".argmin()" just sort your list by distance and then return the first ten elements of the sorted list. Then find their indices like you're already doing it.

Related

Travelling Salesperson with scipy.optimize.dual_annealing [duplicate]

How do I solve a Travelling Salesman problem in python? I did not find any library, there should be a way using scipy functions for optimization or other libraries.
My hacky-extremelly-lazy-pythonic bruteforcing solution is:
tsp_solution = min( (sum( Dist[i] for i in izip(per, per[1:])), n, per) for n, per in enumerate(i for i in permutations(xrange(Dist.shape[0]), Dist.shape[0])) )[2]
where Dist (numpy.array) is the distance matrix.
If Dist is too big this will take forever.
Suggestions?
The scipy.optimize functions are not constructed to allow straightforward adaptation to the traveling salesman problem (TSP). For a simple solution, I recommend the 2-opt algorithm, which is a well-accepted algorithm for solving the TSP and relatively straightforward to implement. Here is my implementation of the algorithm:
import numpy as np
# Calculate the euclidian distance in n-space of the route r traversing cities c, ending at the path start.
path_distance = lambda r,c: np.sum([np.linalg.norm(c[r[p]]-c[r[p-1]]) for p in range(len(r))])
# Reverse the order of all elements from element i to element k in array r.
two_opt_swap = lambda r,i,k: np.concatenate((r[0:i],r[k:-len(r)+i-1:-1],r[k+1:len(r)]))
def two_opt(cities,improvement_threshold): # 2-opt Algorithm adapted from https://en.wikipedia.org/wiki/2-opt
route = np.arange(cities.shape[0]) # Make an array of row numbers corresponding to cities.
improvement_factor = 1 # Initialize the improvement factor.
best_distance = path_distance(route,cities) # Calculate the distance of the initial path.
while improvement_factor > improvement_threshold: # If the route is still improving, keep going!
distance_to_beat = best_distance # Record the distance at the beginning of the loop.
for swap_first in range(1,len(route)-2): # From each city except the first and last,
for swap_last in range(swap_first+1,len(route)): # to each of the cities following,
new_route = two_opt_swap(route,swap_first,swap_last) # try reversing the order of these cities
new_distance = path_distance(new_route,cities) # and check the total distance with this modification.
if new_distance < best_distance: # If the path distance is an improvement,
route = new_route # make this the accepted best route
best_distance = new_distance # and update the distance corresponding to this route.
improvement_factor = 1 - best_distance/distance_to_beat # Calculate how much the route has improved.
return route # When the route is no longer improving substantially, stop searching and return the route.
Here is an example of the function being used:
# Create a matrix of cities, with each row being a location in 2-space (function works in n-dimensions).
cities = np.random.RandomState(42).rand(70,2)
# Find a good route with 2-opt ("route" gives the order in which to travel to each city by row number.)
route = two_opt(cities,0.001)
And here is the approximated solution path shown on a plot:
import matplotlib.pyplot as plt
# Reorder the cities matrix by route order in a new matrix for plotting.
new_cities_order = np.concatenate((np.array([cities[route[i]] for i in range(len(route))]),np.array([cities[0]])))
# Plot the cities.
plt.scatter(cities[:,0],cities[:,1])
# Plot the path.
plt.plot(new_cities_order[:,0],new_cities_order[:,1])
plt.show()
# Print the route as row numbers and the total distance travelled by the path.
print("Route: " + str(route) + "\n\nDistance: " + str(path_distance(route,cities)))
If the speed of algorithm is important to you, I recommend pre-calculating the distances and storing them in a matrix. This dramatically decreases the convergence time.
Edit: Custom Start and End Points
For a non-circular path (one which ends at a location different from where it starts), edit the path distance formula to
path_distance = lambda r,c: np.sum([np.linalg.norm(c[r[p+1]]-c[r[p]]) for p in range(len(r)-1)])
and then reorder the cities for plotting using
new_cities_order = np.array([cities[route[i]] for i in range(len(route))])
With the code as it is, the starting city is fixed as the first city in cities, and the ending city is variable.
To make the ending city the last city in cities, restrict the range of swappable cities by changing the range of swap_first and swap_last in two_opt() with the code
for swap_first in range(1,len(route)-3):
for swap_last in range(swap_first+1,len(route)-1):
To make both the starting and ending cities variable, instead expand the range of swap_first and swap_last with
for swap_first in range(0,len(route)-2):
for swap_last in range(swap_first+1,len(route)):
I recently found out this option to use linear optimization for the TSP problem
https://gist.github.com/mirrornerror/a684b4d439edbd7117db66a56f2483e0
Nonetheless I agree with some of the other comments, just a remainder that there are ways to use linear optimization for this problem.
Some academic publications include the following
http://www.opl.ufc.br/post/tsp/
https://phabi.ch/2021/09/19/tsp-subtour-elimination-by-miller-tucker-zemlin-constraint/

Anyone knows a more efficient way to run a pairwise comparison of hundreds of trajectories?

So I have two different files containing multiple trajectories in a squared map (512x512 pixels). Each file contains information about the spatial position of each particle within a track/trajectory (X and Y coordinates) and to which track/trajectory that spot belongs to (TRACK_ID).
My goal was to find a way to cluster similar trajectories between both files. I found a nice way to do this (distance clustering comparison), but the code it's too slow. I was just wondering if someone has some suggestions to make it faster.
My files look something like this:
The approach that I implemented finds similar trajectories based on something called Fréchet Distance (maybe not to relevant here). Below you can find the function that I wrote, but briefly this is the rationale:
group all the spots by track using pandas.groupby function for file1 (growth_xml) and file2 (shrinkage_xml)
for each trajectories in growth_xml (loop) I compare with each trajectory in growth_xml
if they pass the Fréchet Distance criteria that I defined (an if statement) I save both tracks in a new table. you can see an additional filter condition that I called delay, but I guess that is not important to explain here.
so really simple:
def distance_clustering(growth_xml,shrinkage_xml):
coords_g = pd.DataFrame() # empty dataframes to save filtered tracks
coords_s = pd.DataFrame()
counter = 0 #initalize counter to count number of filtered tracks
for track_g, param_g in growth_xml.groupby('TRACK_ID'):
# define growing track as multi-point line object
traj1 = [(x,y) for x,y in zip(param_g.POSITION_X.values, param_g.POSITION_Y.values)]
for track_s, param_s in shrinkage_xml.groupby('TRACK_ID'):
# define shrinking track as a second multi-point line object
traj2 = [(x,y) for x,y in zip(param_s.POSITION_X.values, param_s.POSITION_Y.values)]
# compute delay between shrinkage and growing ends to use as an extra filter
delay = (param_s.FRAME.iloc[0] - param_g.FRAME.iloc[0])
# keep track only if the frechet Distance is lower than 0.2 microns
if frechetDist(traj1, traj2) < 0.2 and delay > 0:
counter += 1
param_g = param_g.assign(NEW_ID = np.ones(param_g.shape[0]) * counter)
coords_g = pd.concat([coords_g, param_g])
param_s = param_s.assign(NEW_ID = np.ones(param_s.shape[0]) * counter)
coords_s = pd.concat([coords_s, param_s])
coords_g.reset_index(drop = True, inplace = True)
coords_s.reset_index(drop = True, inplace = True)
return coords_g, coords_s
The main problem is that most of the times I have more than 2 thousand tracks (!!) and this pairwise combination takes forever. I'm wondering if there's a simple and more efficient way to do this. Perhaps by doing the pairwise combination in multiple small areas instead of the whole map? not sure...
Have you tried to make a matrix (DeltaX,DeltaY) lookUpTable for the pairwise combination distance. It will take some long time to calc the LUT once, or you can write it in a file and load it when the algo starts.
Then you'll only have to look on correct case to have the result instead of calc each time.
You can too make a polynomial regression for the distance calc, it will be less precise but definitely faster
Maybe not an outright answer, but it's been a while. Could you not segment the lines and use minimum bounding box around each segment to assess similarities? I might be thinking of your problem the wrong way around. I'm not sure. Right now I'm trying to work with polygons from two different data sets and want to optimize the processing by first identifying the polygons in both geometries that overlap.
In your case, I think segments would you leave you with some edge artifacts. Maybe look at this paper: https://drops.dagstuhl.de/opus/volltexte/2021/14879/pdf/OASIcs-ATMOS-2021-10.pdf or this paper (with python code): https://www.austriaca.at/0xc1aa5576_0x003aba2b.pdf

fast comparison of large amount of list of lists

Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])
Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.

Python: Single linkage clustering algorithm

I am new to Python and I am looking for an example of a naive, simple single linkage clustering python algorithm that is based on creating a proximity matrix and removing nodes from that. I know that there are packages such as numpy but I would rather avoid them.
I have searched online but couldn't find any code simple enough to be able to understand in order to replicate it myself afterwards.
Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
Find the most similar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.
Increment the sequence number: m = m + 1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)]
Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined as d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)].
If all objects are in one cluster, stop. Else, go to step 2.
These are the steps as described in wikipedia. I have created the distance matrix but not sure how to proceed form there.
This is what I have so far:
comparing
def comparison(protein1, protein2):
l = [i for i in range(len(protein1)) if protein1[i] != protein2[i]]
return len(l)
creating the matrix
def matrix (r1,r2):
r = []
for p1 in proteins:
r2 = []
for p2 in proteins:
r2 += [comparison(p1, p2)]
r += [r2]
return r
These are the sequences I am trying to compare:
seqlist = { "Human": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHG", "Chimpanzee": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHG", "Western tarsier":"MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGXNLHG", "Mouse": "MGDAEAGKKIFVQKCAQCHTVEKGGKHKTGPNLWG", "Rabbit": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Dog": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Pig": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Snapping turtle":"MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLNG", "Alligator": "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHG", "Honeybee": "AGDPEKGKKIFVQKCAQCHTIESGGKHKVGPNLYG", }
You should look at the package scipy which has several hierarchical clustering algorithms implemented (see scipy.cluster.hierarchy). Look for the function pdist in the scipy.spatial module.
You should be able to get lots of nice usage examples from there.
See http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

sort list of floating-point numbers in groups

I have an array of floating-point numbers, which is unordered. I know that the values always fall around a few points, which are not known. For illustration, this list
[10.01,5.001,4.89,5.1,9.9,10.1,5.05,4.99]
has values clustered around 5 and 10, so I would like [5,10] as answer.
I would like to find those clusters for lists with 1000+ values, where the nunber of clusters is probably around 10 (for some given tolerance). How to do that efficiently?
Check python-cluster. With this library you could do something like this :
from cluster import *
data = [10.01,5.001,4.89,5.1,9.9,10.1,5.05,4.99]
cl = HierarchicalClustering(data, lambda x,y: abs(x-y))
print [mean(cluster) for cluster in cl.getlevel(1.0)]
And you would get:
[5.0062, 10.003333333333332]
(This is a very silly example, because I don't really know what you want to do, and because this is the first time I've used this library)
You can try the following method:
Sort the array first, and use diff() to calculate the difference between two continuous values. the difference larger than threshold can be consider as the split position:
import numpy as np
x = [10.01,5.001,4.89,5.1,9.9,10.1,5.05,4.99]
x = np.sort(x)
th = 0.5
print [group.mean() for group in np.split(x, np.where(np.diff(x) > th)[0]+1)]
the result is:
[5.0061999999999998, 10.003333333333332]

Categories