Removing the 2d point that are close to each others - python

I would like to remove the coordinates that lie close to each other or if it is just a duplicate.
For example,
x = [[9, 169], [5, 164],[340,210],[1020,102],[210,312],[12,150]]
In the above list, the first and second element lies close to each other. How do I remove the second element while preserving the first one?
Following is what I have tried,
def process(input_list, thresh=(10, 10)):
buffer = input_list.copy()
n = 0
prev_cx, prev_cy = 0, 0
for i in range(len(input_list)):
elem = input_list[i]
cx, cy = elem
if n == 0:
prev_cx, prev_cy = cx, cy
else:
ab_cx, ab_cy = abs(prev_cx - cx), abs(prev_cy - cy)
if ab_cx <= thresh[0] and ab_cy <= thresh[1]:
del buffer[i]
n += 1
return buffer
x = [[9, 169], [5, 164], [340, 210], [1020, 102], [210, 312], [12, 150]]
processed = process(x)
print(processed)
The problem is that it doesn't recursively check if there are any other duplicates since it only checks the adjacent coordinates. What is an efficient way of filtering the coordinate?
Sample Input with thresh = (10,10):
x = [[12,24], [5, 12],[100,1020], [20,30], [121,214], [15,12]]
Sample output:
x = [[12,24],[100,1020], [121,214]]

Your question is a bit vague, but I'm taking it to mean:
You want to compare all combinations of points
If a combination contains points closer than a threshold
Then remove the point further from the start of the input list
Try this:
import itertools
def process(input_list, threshold=(10,10)):
combos = itertools.combinations(input_list, 2)
points_to_remove = [point2
for point1, point2 in combos
if abs(point1[0]-point2[0])<=threshold[0] and abs(point1[1]-point2[1])<=threshold[1]]
points_to_keep = [point for point in input_list if point not in points_to_remove]
return points_to_keep
coords = [[12,24], [5, 12],[100,1020], [20,30], [121,214], [15,12]]
print(process(coords))
>>> [[12, 24], [5, 12], [100, 1020], [121, 214]]
The way this works is to generate all combinations of points using itertools (which leaves the points in the original order), and then create a list of points to remove using the threshold. Then it returns a list of points not in that list of points to remove.
You'll notice that I have an extra point than you. I simply copied what seemed to be your intended functionality (i.e. both dy AND dx <= thresh for point removal). However, if I change the line with the AND statement to remove point if dy OR dx <= thresh, I get the same output as your sample.
So I'm going to ask you to recheck your sample output.
BTW, it might be useful for you to confirm if checking for x and y proximity separately is what you really want. So as a bonus, I've included a version using the Euclidean distance as well:
import itertools
import math
def process(input_list, threshold=100):
combos = itertools.combinations(input_list, 2)
points_to_remove = [point2 for point1, point2 in combos if math.dist(point1, point2)<=threshold]
points_to_keep = [point for point in input_list if point not in points_to_remove]
return points_to_keep
coords = [[12,24], [5, 12],[100,1020], [20,30], [121,214], [15,12]]
print(process(coords))
>>> [[12, 24], [100, 1020], [121, 214]]
This version fits your original sample when I used a threshold radius of 100.

I'd split this up a bit different. It's also tricky of course because of the way you have to modify the list.
def remove_close_neighbors(input_list, thresh, position):
target_item = input_list[position]
return [item for i, item in enumerate(input_list) if i == position or not is_close(target_item, item, thresh)]
This will remove all the "duplicate" (or close) points, other than the item under consideration.
(Then define is_close to check the threshold condition)
And then we can go over our items:
def process(input_list, thresh):
pos = 0
while pos < len(input_list):
input_list = remove_close_neighbors(input_list, thresh, pos)
pos += 1
This is by no means the most efficient way to achieve this. Depends on how scalable this needs to be for you. If we're talking "a bajillion points", you will need to look into clever data structures and algorithms. I think a tree structure could be good then, to group points "by sector", because then you don't have to compare each point to each other point all the time.

Related

How to group approximately adjacent list

I have a list that has approximately adjacent.
x=[10,11,13,70,71,73,170,171,172,174]
I need to separate this into lists which has minimum deviation (i.e)
y=[[10,11,13],[70,71,73],[170,171,172,174]]
You can see in y list grouped into 3 separate lists and break this list when meeting huge deviation.
Can you give me a tip or any source to solve this?
the zip function is your friend when you need to compare items of a list with their successor or predecessor:
x=[10,11,13,70,71,73,170,171,172,174]
threshold = 50
breaks = [i for i,(a,b) in enumerate(zip(x,x[1:]),1) if b-a>threshold]
groups = [x[s:e] for s,e in zip([0]+breaks,breaks+[None])]
print(groups)
[[10, 11, 13], [70, 71, 73], [170, 171, 172, 174]]
breaks will contain the index (i) of elements (b) that are greater than their predecessor (a) by more than the treshold value.
Using zip() again allows you to pair up these break indexes to form start/end ranges which you can apply to the original list to get your groupings.
Note that i used a fixed threshold to detect a "huge" deviation, but you can use a percentage or any formula/condition of your choice in place of if b-a>threshold. If the deviation calculation is complex, you will probably want to make a deviates() function and use it in the list comprehension: if deviates(a,b) so that it remains intelligible
If zip() and list comprehensions are too advanced, you can do the same thing using a simple for-loop:
def deviates(a,b): # example of a (huge) deviation detection function
return b-a > 50
groups = [] # resulting list of groups
previous = None # track previous number for comparison
for number in x:
if not groups or deviates(previous, number):
groups.append([number]) # 1st item or deviation, add new group
else:
groups[-1].append(number) # approximately adjacent, add to last group
previous = number # remember previous value for next loop
Something like this should do the trick:
test_list = [10, 11, 13, 70, 71, 73, 170, 171, 172, 174]
def group_approximately_adjacent(numbers):
if not numbers:
return []
current_number = numbers.pop(0)
cluster = [current_number]
clusters = [cluster]
while numbers:
next_number = numbers.pop(0)
if is_approximately_adjacent(current_number, next_number):
cluster.append(next_number)
else:
cluster = [next_number]
clusters.append(cluster)
current_number = next_number
return clusters
def is_approximately_adjacent(a, b):
deviation = 0.25
return abs(a * (1 + deviation)) > abs(b) > abs(a * (1 - deviation))

Filter list of points by distance and score

I'm faced with a problem I'm trying to find an elegant solution to, but I don't have the in-depth knowledge of various libraries to achieve this yet.
I have a list of an object of the class 'Point' that contains an x and y coordinate and a score.
pts = [pt_1, pt_2, pt_3 ... pt_n]
Now, I need to find a way to filter this list, so from two points in the list that are too close to each other, we remove one with a lower score.
So, given two points in list pts, pt_x and pt_y, if their Euclidean distance is smaller than threshold T, compare their score, and remove the point from the list with a smaller score. I have all the necessary classes and comparison function implemented, I'm just not sure how to efficiently achieve the filtering of the list and wasn't able to successfuly do so.
for idx, ptp in enumerate(pts[:]):
for pto in pts[idx+1:]:
if ptp.is_match(pto, 60):
if ptp.score < pto.score:
pts.remove(ptp)
break
else:
pts.remove(pto)
return pts
The above was my attempt but didn't work.
To give a better idea, here's an input and expected output given Point(x, y, score) and distance threshold of 3 units:
Input:
pts = [
Point(11, 10, 0.97),
Point(10, 11, 0.96),
Point(50, 47, 0.87),
Point(10, 10, 0.98),
Point(78, 56, 0.99)
]
Output:
pts = [
Point(50, 47, 0.87),
Point(10, 10, 0.98),
Point(78, 56, 0.99)
]
Point class:
class Point:
def __init__(self, x, y, score):
self.x = x
self.y = y
self.score = score
def is_match(self, point, thresh=50):
return ((self.x - point.x) ** 2 + (self.y - point.y) ** 2) ** 0.5 <= thresh
EDIT:
Found a better solution that works for arbitrarily large samples, but is far from fast and elegant:
idx = 0
while idx < len(pts):
pt = pts[idx]
pt_s = list(filter(lambda ptt: ptt.is_match(pt, distance), pts[idx:]))
if len(pt_s) > 1:
max_pt = max(pt_s, key=operator.attrgetter('score'))
pt_s.remove(max_pt)
for pt in pt_s:
pts.remove(pt)
else:
idx += 1
return pts
The problem with your code is that you're removing elements from a list while iterating over it. This is generally a bad idea. You can modify it slightly to store a set of elements scheduled for removal like so
newps = []
ignore = set()
while ps:
p = ps.pop()
if p not in ignore:
for i,o in enumerate(ps):
if o not in ignore and p.is_match(o,3):
if p.score < o.score:
break
else:
ignore.add(o)
else:
newps.append(p)
print(newps)
Note: the else clause of a for loop executes when the loop completed, which is to say there was no break
While we're at it, why not go further and do away with lists all-together to make things cleaner.
ps = {
Point(11, 10, 0.97),
Point(10, 11, 0.96),
Point(50, 47, 0.87),
Point(10, 10, 0.98),
Point(78, 56, 0.99)
}
newps = set()
while ps:
p = ps.pop()
remove = set()
for o in ps:
if p.is_match(o,3):
if p.score < o.score:
break
else:
remove.add(o)
else:
newps.add(p)
ps.difference_update(remove)
print(newps)

How to make sure that a list of generated numbers follow a uniform distribution

I have a list of 150 numbers from 0 to 149. I would like to use a for loop with 150 iterations in order to generate 150 lists of 6 numbers such that,t in each iteration k, the number k is included as well as 5 different random numbers. For example:
S0 = [0, r1, r2, r3, r4, r5] # r1, r2,..., r5 are random numbers between 0 and 150
S1 = [1, r1', r2', r3', r4', r5'] # r1', r2',..., r5' are new random numbers between 0 and 150
...
S149 = [149, r1'', r2'', r3'', r4'', r5'']
In addition, the numbers in each list have to be different and with a minimum distance of 5. This is the code I am using:
import random
import numpy as np
final_list = []
for k in range(150):
S = [k]
for it in range(5):
domain = [ele for ele in range(150) if ele not in S]
d = 0
x = k
while d < 5:
d = np.Infinity
x = random.sample(domain, 1)[0]
for ch in S:
if np.abs(ch - x) < d:
d = np.abs(ch - x)
S.append(x)
final_list.append(S)
Output:
[[0, 149, 32, 52, 39, 126],
[1, 63, 16, 50, 141, 79],
[2, 62, 21, 42, 35, 71],
...
[147, 73, 38, 115, 82, 47],
[148, 5, 78, 115, 140, 43],
[149, 36, 3, 15, 99, 23]]
Now, the code is working but I would like to know if it's possible to force that number of repetitions that each number has through all the iterations is approximately the same. For example, after using the previous code, this plot indicates how many times each number has appeared in the generated lists:
As you can see, there are numbers that have appeared more than 10 times while there are others that have appeared only 2 times. Is it possible to reduce this level of variation so that this plot can be approximated as a uniform distribution? Thanks.
First, I am not sure that your assertion that the current results are not uniformly distributed is necessarily correct. It would seem prudent to me to try and examine the histogram over several repetitions of the process, rather than just one.
I am not a statistician, but when I want to approximate uniform distribution (and assuming that the functions in random provide uniform distribution), what I try to do is to simply accept all results returned by random functions. For that, I need to limit the choices given to these functions ahead of calling them. This is how I would go about your task:
import random
import numpy as np
N = 150
def random_subset(n):
result = []
cands = set(range(N))
for i in range(6):
result.append(n) # Initially, n is the number that must appear in the result
cands -= set(range(n - 4, n + 5)) # Remove candidates less than 5 away
n = random.choice(list(cands)) # Select next number
return result
result = np.array([random_subset(n) for n in range(N)])
print(result)
Simply put, whenever I add a number n to the result set, I take out of the selection candidates, an environment of the proper size, to ensure no number of a distance of less than 5 can be selected in the future.
The code is not optimized (multiple set to list conversions) but it works (as per my uderstanding).
You can force it to be precisely uniform, if you so desire.
Apologies for the mix of globals and locals, this seemed the most readable. You would want to rewrite according to how variable your constants are =)
import random
SIZE = 150
SAMPLES = 5
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
item = [i] + selection
items.append(item)
return items
Then you will have exactly 5 of each (and one more in the leading position, which is a weird data structure).
>>> set(collections.Counter(vv for v in get_samples() for vv in v).values())
{6}
The method above does not guarantee the last 5 numbers are unique, in fact, you would expect ~10/150 to have a duplicate. If that is important, you need to filter your distribution a little more and decide how well you value tight uniformity, duplicates, etc.
If your numbers are approximately what you gave above, you also can patch up the results (fairly) and hope to avoid long search times (not the case for SAMPLES sizes closer to OPTIONS size)
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
i = 0
while i < len(pool):
if i % SAMPLES == 0:
seen = set()
v = pool[i]
if v in seen: # swap
dst = random.choice(range(SIZE))
pool[dst], pool[i] = pool[i], pool[dst]
i = dst - dst % SAMPLES # Restart from swapped segment
else:
seen.add(v)
i += 1
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
assert len(set(selection)) == SAMPLES, selection
item = [i] + selection
items.append(item)
return items
This will typically take less than 5 passes through to clean up any duplicates, and should leave all arrangements satisfying your conditions equally likely.

Value in an array between two numbers in python

So making a title that actually explains what i want is harder than i thought, so here goes me explaining it.
I have an array filled with zeros that adds values every time a condition is met, so after 1 time step iteration i get something like this (minus the headers):
current_array =
bubble_size y_coord
14040 42
3943 71
6345 11
0 0
0 0
....
After this time step is complete this current_array gets set as previous_array and is wiped with zeros because there is not a guaranteed number of entries each time.
NOW the real question is i want to be able to check all rows in the first column of the previous_array and see if the current bubble size is within say 5% either side and if so i want to take the current y position away for the value associated with the matching bubble size number in the previous_array's second column.
currently i have something like;
if bubble_size in current_array[:, 0]:
do_whatever
but i don't know how to pull out the associated y_coord without using a loop, which i am fine with doing (there is about 100 rows to the array and atleast 1000 time steps so i want to make it as efficient as possible) but would like to avoid
i have included my thoughts on the for loop (note the current and previous_array are actually current and previous_frame)
for y in range (0, array_size):
if bubble_size >> previous_frame[y,0] *.95 &&<< previous_frame[y, 0] *1.05:
distance_travelled = current_y_coord - previous_frame[y,0]
y = y + 1
Any help is greatly appreciated :)
I probably did not get your issue here but if you want to first check if the bubble size is in between the same row element 95 % you can use the following:
import numpy as np
def apply(p, c): # For each element check the bubblesize grow
if(p*0.95 < c < p*1.05):
return 1
else:
return 0
def dist(p, c): # Calculate the distance
return c-p
def update(prev, cur):
assert isinstance(
cur, np.ndarray), 'Current array is not a valid numpy array'
assert isinstance(
prev, np.ndarray), 'Previous array is not a valid numpy array'
assert prev.shape == cur.shape, 'Arrays size mismatch'
applyvec = np.vectorize(apply)
toapply = applyvec(prev[:, 0], cur[:, 0])
print(toapply)
distvec = np.vectorize(dist)
distance = distvec(prev[:, 1], cur[:, 1])
print(distance)
current = np.array([[14040, 42],
[3943,71],
[6345,11],
[0,0],
[0,0]])
previous = np.array([[14039, 32],
[3942,61],
[6344,1],
[0,0],
[0,0]])
update(previous,current)
PS: Please, could you tell us what is the final array you look for based on my examples?
As I understand it (correct me if Im wrong):
You have a current bubble size (integer) and a current y value (integer)
You have a 2D array (prev_array) that contains bubble sizes and y coords
You want to check whether your current bubble size is within 5% (either way) of each stored bubble size in prev_array
If they are within range, subtract your current y value from the stored y coord
This will result in a new array, containing only bubble sizes that are within range, and the newly subtracted y value
You want to do this without an explicit loop
You can do that using boolean indexing in numpy...
Setup the previous array:
prev_array = np.array([[14040, 42], [3943, 71], [6345, 11], [3945,0], [0,0]])
prev_array
array([[14040, 42],
[ 3943, 71],
[ 6345, 11],
[ 3945, 0],
[ 0, 0]])
You have your stored bubble size you want to use for comparison, and a current y coord value:
bubble_size = 3750
cur_y = 10
Next we can create a boolean mask where we only select rows of prev_array that meets the 5% criteria:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
# ind is a boolean array that looks like this: [False, True, False, True, False]
Then we use ind to index prev_array, and calculate the new (subtracted) y coords:
new_array = prev_array[ind]
new_array[:,1] = cur_y - new_array[:,1]
Giving your final output array:
array([[3943, -61],
[3945, 10]])
As its not clear what you want your output to actually look like, instead of creating a new array, you can also just update prev_array with the new y values:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
prev_array[ind,1] = cur_y - prev_array[ind,1]
Which gives:
array([[14040, 42],
[ 3943, -61],
[ 6345, 11],
[ 3945, 10],
[ 0, 0]])

what this python code trying to do

The following python code is to traverse a 2D grid of (c, g) in some special order, which is stored in "jobs" and "job_queue". But I am not sure which kind of order it is after trying to understand the code. Is someone able to tell about the order and give some explanation for the purpose of each function? Thanks and regards!
import Queue
c_begin, c_end, c_step = -5, 15, 2
g_begin, g_end, g_step = 3, -15, -2
def range_f(begin,end,step):
# like range, but works on non-integer too
seq = []
while True:
if step > 0 and begin > end: break
if step < 0 and begin < end: break
seq.append(begin)
begin = begin + step
return seq
def permute_sequence(seq):
n = len(seq)
if n <= 1: return seq
mid = int(n/2)
left = permute_sequence(seq[:mid])
right = permute_sequence(seq[mid+1:])
ret = [seq[mid]]
while left or right:
if left: ret.append(left.pop(0))
if right: ret.append(right.pop(0))
return ret
def calculate_jobs():
c_seq = permute_sequence(range_f(c_begin,c_end,c_step))
g_seq = permute_sequence(range_f(g_begin,g_end,g_step))
nr_c = float(len(c_seq))
nr_g = float(len(g_seq))
i = 0
j = 0
jobs = []
while i < nr_c or j < nr_g:
if i/nr_c < j/nr_g:
# increase C resolution
line = []
for k in range(0,j):
line.append((c_seq[i],g_seq[k]))
i = i + 1
jobs.append(line)
else:
# increase g resolution
line = []
for k in range(0,i):
line.append((c_seq[k],g_seq[j]))
j = j + 1
jobs.append(line)
return jobs
def main():
jobs = calculate_jobs()
job_queue = Queue.Queue(0)
for line in jobs:
for (c,g) in line:
job_queue.put((c,g))
main()
EDIT:
There is a value for each (c,g). The code actually is to search in the 2D grid of (c,g) to find a grid point where the value is the smallest. I guess the code is using some kind of heuristic search algorithm? The original code is here http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/gridsvr/gridregression.py, which is a script to search for svm algorithm the best values for two parameters c and g with minimum validation error.
permute_sequence reorders a list of values so that the middle value is first, then the midpoint of each half, then the midpoints of the four remaining quarters, and so on. So permute_sequence(range(1000)) starts out like this:
[500, 250, 750, 125, 625, 375, ...]
calculate_jobs alternately fills in rows and columns using the sequences of 1D coordinates provided by permute_sequence.
If you're going to search the entire 2D space eventually anyway, this does not help you finish sooner. You might as well just scan all the points in order. But I think the idea was to find a decent approximation of the minimum as early as possible in the search. I suspect you could do about as well by shuffling the list randomly.
xkcd readers will note that the urinal protocol would give only slightly different (and probably better) results:
[0, 1000, 500, 250, 750, 125, 625, 375, ...]
Here is an example of permute_sequence in action:
print permute_sequence(range(8))
# prints [4, 2, 6, 1, 5, 3, 7, 0]
print permute_sequence(range(12))
# prints [6, 3, 9, 1, 8, 5, 11, 0, 7, 4, 10, 2]
I'm not sure why it uses this order, because in main, it appears that all candidate pairs of (c,g) are still evaluated, I think.

Categories