How to find closest elements in two array? - python

I have two numpy arrays, like X=[x1,x2,x3,x4], y=[y1,y2,y3,y4]. Three of the elements are close and the fourth of them maybe close or not.
Like:
X [ 84.04467948 52.42447842 39.13555678 21.99846595]
y [ 78.86529444 52.42447842 38.74910101 21.99846595]
Or it can be:
X [ 84.04467948 60 52.42447842 39.13555678]
y [ 78.86529444 52.42447842 38.74910101 21.99846595]
I want to define a function to find the the corresponding index in the two arrays, like in first case:
y[0] correspond to X[0],
y[1] correspond to X[1],
y[2] correspond to X[2],
y[3] correspond to X[3]
And in second case:
y[0] correspond to X[0],
y[1] correspond to X[2],
y[2] correspond to X[3]
and y[3] correspond to X[1].
I can't write a function to solve the problem completely, please help.

You can start by precomputing the distance matrix as show in this answer:
import numpy as np
X = np.array([84.04467948,60.,52.42447842,39.13555678])
Y = np.array([78.86529444,52.42447842,38.74910101,21.99846595])
dist = np.abs(X[:, np.newaxis] - Y)
Now you can compute the minimums along one axis (I chose 1 corresponding to finding the closest element of Y for every X):
potentialClosest = dist.argmin(axis=1)
This still may contain duplicates (in your case 2). To check for that, you can find find all Y indices that appear in potentialClosest by use of np.unique:
closestFound, closestCounts = np.unique(potentialClosest, return_counts=True)
Now you can check for duplicates by checking if closestFound.shape[0] == X.shape[0]. If so, you're golden and potentialClosest will contain your partners for every element in X. In your case 2 though, one element will occur twice and therefore closestFound will only have X.shape[0]-1 elements whereas closestCounts will not contain only 1s but one 2. For all elements with count 1 the partner is already found. For the two candidates with count 2, though you will have to choose the closer one while the partner of the one with the larger distance will be the one element of Y which is not in closestFound. This can be found as:
missingPartnerIndex = np.where(
np.in1d(np.arange(Y.shape[0]), closestFound)==False
)[0][0]
You can do the matchin in a loop (even though there might be some nicer way using numpy). This solution is rather ugly but works. Any suggestions for improvements are very appreciated:
partners = np.empty_like(X, dtype=int)
nonClosePartnerFound = False
for i in np.arange(X.shape[0]):
if closestCounts[closestFound==potentialClosest[i]][0]==1:
# A unique partner was found
partners[i] = potentialClosest[i]
else:
# Partner is not unique
if nonClosePartnerFound:
partners[i] = potentialClosest[i]
else:
if np.argmin(dist[:, potentialClosest[i]]) == i:
partners[i] = potentialClosest[i]
else:
partners[i] = missingPartnerIndex
nonClosePartnerFound = True
print(partners)
This answer will only work if only one pair is not close. If that is not the case, you will have to define how to find the correct partner for multiple non-close elements. Sadly it's neither a very generic nor a very nice solution, but hopefully you will find it a helpful starting point.

Using this answer https://stackoverflow.com/a/8929827/3627387 and https://stackoverflow.com/a/12141207/3627387
FIXED
def find_closest(alist, target):
return min(alist, key=lambda x:abs(x-target))
X = [ 84.04467948, 52.42447842, 39.13555678, 21.99846595]
Y = [ 78.86529444, 52.42447842, 38.74910101, 21.99846595]
def list_matching(list1, list2):
list1_copy = list1[:]
pairs = []
for i, e in enumerate(list2):
elem = find_closest(list1_copy, e)
pairs.append([i, list1.index(elem)])
list1_copy.remove(elem)
return pairs

Seems like best approach would be to pre-sort both array (nlog(n)) and then perform merge-like traverse through both arrays. It's definitely faster than nn which you indicated in comment.

The below simply prints the corresponding indexes of the two arrays as you have done in your question as I'm not sure what output you want your function to give.
X1 = [84.04467948, 52.42447842, 39.13555678, 21.99846595]
Y1 = [78.86529444, 52.42447842, 38.74910101, 21.99846595]
X2 = [84.04467948, 60, 52.42447842, 39.13555678]
Y2 = [78.86529444, 52.42447842, 38.74910101, 21.99846595]
def find_closest(x_array, y_array):
# Copy x_array as we will later remove an item with each iteration and
# require the original later
remaining_x_array = x_array[:]
for y in y_array:
differences = []
for x in remaining_x_array:
differences.append(abs(y - x))
# min_index_remaining is the index position of the closest x value
# to the given y in remaining_x_array
min_index_remaining = differences.index(min(differences))
# related_x is the closest x value of the given y
related_x = remaining_x_array[min_index_remaining]
print 'Y[%s] corresponds to X[%s]' % (y_array.index(y), x_array.index(related_x))
# Remove the corresponding x value in remaining_x_array so it
# cannot be selected twice
remaining_x_array.pop(min_index_remaining)
This then outputs the following
find_closest(X1,Y1)
Y[0] corresponds to X[0]
Y[1] corresponds to X[1]
Y[2] corresponds to X[2]
Y[3] corresponds to X[3]
and
find_closest(X2,Y2)
Y[0] corresponds to X[0]
Y[1] corresponds to X[2]
Y[2] corresponds to X[3]
Y[3] corresponds to X[1]
Hope this helps.

Related

Iterate through ranges and return those not in any range?

I have a list of floats.
values = [2.3, 6.4, 11.3]
What I want to do is find a range from each value in the list of size delta = 2, then iterate through another range of floats and compare each float to each range, then return the floats that do not fall in any ranges.
What I have so far is,
not_in_range =[]
for x in values:
pre = float(x - delta)
post = float(x + delta)
for y in numpy.arange(0,15,0.5):
if (pre <= y <= post) == True:
pass
else:
not_in_range.append(y)
But obviously, this does not work for several reasons: redundancy, does not check all ranges at once, etc. I am new to coding and I am struggling to think abstractly enough to solve this problem. Any help in formulating a plan of action would be greatly appreciated.
EDIT
For clarity, what I want is a list of ranges from each value (or maybe a numpy array?) as
[0.3, 4.3]
[4.4, 8.4]
[9.3, 13.3]
And to return any float from 0 - 15 in increments of 0.5 that do not fall in any of those ranges, so the final output would be:
not_in_ranges = [0, 8.5, 9, 13.5, 14, 14.5]
To generate the list of ranges, you could do a quick list comprehension:
ranges = [[x-2, x+2] for x in values]
## [[0.3, 4.3], [4.4, 8.4], [9.3, 13.3]]
Then, to return any float from 0 to 15 (in increments of 0.5) that don't fall in any of the ranges, you can use:
not_in_ranges = []
for y in numpy.arange(0, 15, 0.5): # for all desired values to check
if not any(pre < y and y < post for pre, post in ranges):
not_in_ranges.append(y) # if it is in none of the intervals, append it
## [0.0, 8.5, 9.0, 13.5, 14.0, 14.5]
Explanation: This loops through each of the possible values and appends it to the not_in_ranges list if it is not in any of the intervals. To check if it is in the intervals, I use the builtin python function any to check if there are any pre and post values in the list ranges that return True when pre < y < post (i.e. if y is in any of the intervals). If this is False, then it doesn't fit into any of the intervals and so is added to the list of such values.
Alternatively, if you only need the result (and not both of the lists), you can combine the two with something like:
not_in_ranges = []
for y in numpy.arange(0, 15, 0.5):
if not any(x-2 < y and y < x+2 for x in values):
not_in_ranges.append(y)
You could even use list comprehension again, giving the very pythonic looking:
not_in_ranges = [y for y in numpy.arange(0, 15, 0.5) if not any(x-2 < y and y < x+2 for x in values)]
Note that the last one is likely the fastest to run since the append call is quite slow and list comprehension is almost always faster. Though it certainly might not be the easiest to understand at a glance if you aren't already used to python list comprehension format.
I have done the comparative analysis (in jupyter notebook). Look the results.
# First cell
import numpy as np
values = np.random.randn(1000000)
values.shape
# Second cell
%%time
not_in_range =[]
for x in values:
pre = float(x - 2)
post = float(x + 2)
for y in np.arange(0,15,0.5):
if (pre <= y <= post) == True:
pass
else:
not_in_range.append(y)
# Second cell output - Wall time: 37.2 s
# Third cell
%%time
pre = values - 2
post = values + 2
whole_range = np.arange(0,15,0.5)
whole_range
search_range = []
for pr, po in zip(pre, post):
pr = (int(pr) + 0.5) if (pr%5) else int(pr)
po = (int(po) + 0.5) if (po%5) else int(po)
search_range += list(np.arange(pr, po, 0.5))
whole_range = set(whole_range)
search_range = set(search_range)
print(whole_range.difference(search_range))
# Third cell output - Wall time: 3.99 s
You can use the interval library intvalpy
from intvalpy import Interval
import numpy as np
values = [2.3, 6.4, 11.3]
delta = 2
intervals = values + Interval(-delta, delta)
not_in_ranges = []
for k in np.arange(0, 15, 0.5):
if not k in intervals:
not_in_ranges.append(k)
print(not_in_ranges)
Intervals are created according to the constructive definitions of interval arithmetic operations.
The in operator checks whether a point (or an interval) is contained within another interval.

How to check if 2 different values are from the same list and obtaining the list name

** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.

Corresponding values of a parameter in Python

I am working in SageMath (Python-based), I am quite new to programming and I have the following question. In my computations I have a quadratic form: x^TAx = b , where the matrix A is defined already as a symmetric matrix, and x is defined as
import itertools
X = itertools.product([0,1], repeat = n)
for x in X:
x = vector(x)
print x
as all combination of [0,1] repeated n times. I got a set of values for b the following way:
import itertools
X = itertools.product([0,1], repeat = n)
results = []
for x in X:
x = vector(x)
x = x.row()
v = x.transpose()
b = x * A * v
results.append(b[0, 0])
And then I defined:
U = set(results)
U1 = sorted(U)
A = []
for i in U1:
U2 = round(i, 2)
A.append(U2)
So I have a sorted set to get a few minimal values of my results. I need to extract minimal values from the set and identify what particular x is corresponding to each value of b. I heard that I can use dictionary method and define preimages in there, but I am really struggling to define my dictionary as {key: value}. Could someone help me please, solve the problem or give me the idea of what direction should I think in? Thank you.

If I have two lines, how do I find which line is on top after an intersection?

I have two lists of points for the graph so I can plot two lines. I'm using matplotlib to find the intersections of the two lines. However, these points are not necessarily in the original list of points. What I'm trying to do is to find which is line on top and which line is on the bottom immediately after the intersection. I was thinking about using the derivative at that point what I am not sure how to do that or if there an is easier way.
line1 = [30,32,40,50,31,20]
line2 = [29,37,23,30,51,32]
l1 = LineString(line1)
l2 = LineString(line2)
intersection = l1.intersection(l2)
intersect_points = [list(p.coords)[0] for p in intersection]
intersect_points would give a list of the intersections between the two lists.We can see from the points that list1 and list2 intersects from 30-32 and 29-37. The actual point of intersection would be given in intersect_points. I want to know which one is on top. In this case, list 2 would be on top after the intersection.
If I understand the problem correctly, I think the easiest thing would be to have the intersection method to also return a value to flag which one of the two lines has a higher value for each point of intersection. That way no indexing would be necessary.
Note: this solutions ignores numpy
Here is a sample code (unoptimized and optimistic):
class Line:
def __init__(self, seq):
self.pts = list(seq)
def intersection(self, line):
""" returns the list of intersections.
each element is a tuple formed by
flag True if the second line is on top
x the position of the intersection along the x axis
y the value of the intersection along the y axis"""
ret = []
pos = 0
z = iter(zip(self.pts, line.pts))
p, pl = next(z)
for c, cl in z:
denom = c - p - cl + pl
if denom:
numer = pl - p
if numer:
x = float(numer) / float(denom)
y = (c - p)*x + p
else:
x = 0.0
y = float(p)
ret.append((cl > c, float(pos) + x, y))
p, pl = c, cl
pos += 1
return ret
Let's try it out:
>>> from intersection import *
>>> a=Line([2,7,4])
>>> b=Line([5,3,6])
>>> a.intersection(b)
[(False, 0.42857142857142855, 4.142857142857142),
(True, 1.6666666666666665, 5.0)]
As you can see, on the first intersection it shows the a line is on top, whereas on the second one line b is on top.
I hope it helps.
:)

Finding index of an item closest to the value in a list that's not entirely sorted

As an example my list is:
[25.75443, 26.7803, 25.79099, 24.17642, 24.3526, 22.79056, 20.84866,
19.49222, 18.38086, 18.0358, 16.57819, 15.71255, 14.79059, 13.64154,
13.09409, 12.18347, 11.33447, 10.32184, 9.544922, 8.813385, 8.181152,
6.983734, 6.048035, 5.505096, 4.65799]
and I'm looking for the index of the value closest to 11.5. I've tried other methods such as binary search and bisect_left but they don't work.
I cannot sort this array, because the index of the value will be used on a similar array to fetch the value at that index.
Try the following:
min(range(len(a)), key=lambda i: abs(a[i]-11.5))
For example:
>>> a = [25.75443, 26.7803, 25.79099, 24.17642, 24.3526, 22.79056, 20.84866, 19.49222, 18.38086, 18.0358, 16.57819, 15.71255, 14.79059, 13.64154, 13.09409, 12.18347, 11.33447, 10.32184, 9.544922, 8.813385, 8.181152, 6.983734, 6.048035, 5.505096, 4.65799]
>>> min(range(len(a)), key=lambda i: abs(a[i]-11.5))
16
Or to get the index and the value:
>>> min(enumerate(a), key=lambda x: abs(x[1]-11.5))
(16, 11.33447)
import numpy as np
a = [25.75443, 26.7803, 25.79099, 24.17642, 24.3526, 22.79056, 20.84866, 19.49222, 18.38086, 18.0358, 16.57819, 15.71255, 14.79059, 13.64154, 13.09409, 12.18347, 11.33447, 10.32184, 9.544922, 8.813385, 8.181152, 6.983734, 6.048035, 5.505096, 4.65799]
index = np.argmin(np.abs(np.array(a)-11.5))
a[index] # here is your result
In case a is already an array, the corresponding transformation can be ommitted.
How about: you zip the two lists, then sort the result?
If you can't sort the array, then there is no quick way to find the closest item - you have to iterate over all entries.
There is a workaround but it's quite a bit of work: Write a sort algorithm which sorts the array and (at the same time) updates a second array which tells you where this entry was before the array was sorted.
That way, you can use binary search to look up index of the closest entry and then use this index to look up the original index using the "index array".
[EDIT] Using zip(), this is pretty simple to achieve:
array_to_sort = zip( original_array, range(len(original_array)) )
array_to_sort.sort( key=i:i[0] )
Now you can binary search for the value (using item[0]). item[1] will give you the original index.
Going through all the items is only linear. If you would sort the array that would be worse.
I don't see a problem on keeping an additional deltax (the min difference so far) and idx (the index of that element) and just loop once trough the list.
Keep in mind that if space isn't important you can sort any list without moving the contents by creating a secondary list of the sorted indices.
Also bear in mind that if you are doing this look up just once, then you will just have to traverse every element in the list O(n). (If multiple times then you probably would want to sort for increase efficiency later)
If you are searching a long list a lot of times, then min scales very bad (O(n^2), if you append some of your searches to the search list, I think).
Bisect is your friend. Here's my solution. It scales O(n*log(n)):
class Closest:
"""Assumes *no* redundant entries - all inputs must be unique"""
def __init__(self, numlist=None, firstdistance=0):
if numlist == None:
numlist=[]
self.numindexes = dict((val, n) for n, val in enumerate(numlist))
self.nums = sorted(self.numindexes)
self.firstdistance = firstdistance
def append(self, num):
if num in self.numindexes:
raise ValueError("Cannot append '%s' it is already used" % str(num))
self.numindexes[num] = len(self.nums)
bisect.insort(self.nums, num)
def rank(self, target):
rank = bisect.bisect(self.nums, target)
if rank == 0:
pass
elif len(self.nums) == rank:
rank -= 1
else:
dist1 = target - self.nums[rank - 1]
dist2 = self.nums[rank] - target
if dist1 < dist2:
rank -= 1
return rank
def closest(self, target):
try:
return self.numindexes[self.nums[self.rank(target)]]
except IndexError:
return 0
def distance(self, target):
rank = self.rank(target)
try:
dist = abs(self.nums[rank] - target)
except IndexError:
dist = self.firstdistance
return dist
Use it like this:
a = [25.75443, 26.7803, 25.79099, 24.17642, 24.3526, 22.79056, 20.84866,
19.49222, 18.38086, 18.0358, 16.57819, 15.71255, 14.79059, 13.64154,
13.09409, 12.18347, 1.33447, 10.32184, 9.544922, 8.813385, 8.181152,
6.983734, 6.048035, 5.505096, 4.65799]
targets = [1.0, 100.0, 15.0, 15.6, 8.0]
cl = Closest(a)
for x in targets:
rank = cl.rank(x)
print("Closest to %5.1f : rank=%2i num=%8.5f index=%2i " % (x, rank,
cl.nums[rank], cl.closest(x)))
Will output:
Closest to 1.0 : rank= 0 num= 1.33447 index=16
Closest to 100.0 : rank=25 num=26.78030 index= 1
Closest to 15.0 : rank=12 num=14.79059 index=12
Closest to 15.6 : rank=13 num=15.71255 index=11
Closest to 8.0 : rank= 5 num= 8.18115 index=20
And:
cl.append(99.9)
x = 100.0
rank = cl.rank(x)
print("Closest to %5.1f : rank=%2i num=%8.5f index=%2i " % (x, rank,
cl.nums[rank], cl.closest(x)))
Output:
Closest to 100.0 : rank=25 num=99.90000 index=25

Categories