Using flood fill algorithm to determine map area with equal height

Using flood fill algorithm to determine map area with equal height - python

I have a list of lists where each element represents the average height in integers of a all square metres contained in the map (one number= one square metre). For example:
map=[
[1,1,1,1],
[1,1,2,2],
[1,2,2,2]
] # where 1 and 2 are the average heights of those coordenates.
I'm trying to implement a method that, given a position looks for the area around him that has the same height. let's call them 'Flat areas'.
I found a solution in the flood-fill algorithm. However, i'm having some problems when it comes to writing the code. I get a
RuntimeError: maximum recursion depth exceeded
I've no idea of where my problem is. Here it is the code of the function:
def zona_igual_alcada(self,pos,zones=[],h=None):
x,y=pos
if h==None:
h=base_terreny.base_terreny.__getitem__(self,(x,y))
if base_terreny.base_terreny.__getitem__(self,(x,y))!=h:
return
if x in range(0,self.files) and y in range(0,self.columnes):
if base_terreny.base_terreny.__getitem__(self,(x,y))==h:
zones.append((x,y))
terreny.zona_igual_alcada(self,(x-1,y),zones,h)
terreny.zona_igual_alcada(self,(x+1,y),zones,h)
terreny.zona_igual_alcada(self,(x,y-1),zones,h)
terreny.zona_igual_alcada(self,(x,y+1),zones,h)
return set(zones)

You're not doing anything to "mark" the zones you have already visited, so you are doing the same zones over and over until the stack fills up.
This isn't a particularly efficient way to do a flood fill, so if you have a large number of zones you will be better off looking for a more efficient algorithm to do the flood fill (eg. scanline fill).

Related

Anyone knows a more efficient way to run a pairwise comparison of hundreds of trajectories?

So I have two different files containing multiple trajectories in a squared map (512x512 pixels). Each file contains information about the spatial position of each particle within a track/trajectory (X and Y coordinates) and to which track/trajectory that spot belongs to (TRACK_ID).
My goal was to find a way to cluster similar trajectories between both files. I found a nice way to do this (distance clustering comparison), but the code it's too slow. I was just wondering if someone has some suggestions to make it faster.
My files look something like this:
The approach that I implemented finds similar trajectories based on something called Fréchet Distance (maybe not to relevant here). Below you can find the function that I wrote, but briefly this is the rationale:
group all the spots by track using pandas.groupby function for file1 (growth_xml) and file2 (shrinkage_xml)
for each trajectories in growth_xml (loop) I compare with each trajectory in growth_xml
if they pass the Fréchet Distance criteria that I defined (an if statement) I save both tracks in a new table. you can see an additional filter condition that I called delay, but I guess that is not important to explain here.
so really simple:
def distance_clustering(growth_xml,shrinkage_xml):
coords_g = pd.DataFrame() # empty dataframes to save filtered tracks
coords_s = pd.DataFrame()
counter = 0 #initalize counter to count number of filtered tracks
for track_g, param_g in growth_xml.groupby('TRACK_ID'):
# define growing track as multi-point line object
traj1 = [(x,y) for x,y in zip(param_g.POSITION_X.values, param_g.POSITION_Y.values)]
for track_s, param_s in shrinkage_xml.groupby('TRACK_ID'):
# define shrinking track as a second multi-point line object
traj2 = [(x,y) for x,y in zip(param_s.POSITION_X.values, param_s.POSITION_Y.values)]
# compute delay between shrinkage and growing ends to use as an extra filter
delay = (param_s.FRAME.iloc[0] - param_g.FRAME.iloc[0])
# keep track only if the frechet Distance is lower than 0.2 microns
if frechetDist(traj1, traj2) < 0.2 and delay > 0:
counter += 1
param_g = param_g.assign(NEW_ID = np.ones(param_g.shape[0]) * counter)
coords_g = pd.concat([coords_g, param_g])
param_s = param_s.assign(NEW_ID = np.ones(param_s.shape[0]) * counter)
coords_s = pd.concat([coords_s, param_s])
coords_g.reset_index(drop = True, inplace = True)
coords_s.reset_index(drop = True, inplace = True)
return coords_g, coords_s
The main problem is that most of the times I have more than 2 thousand tracks (!!) and this pairwise combination takes forever. I'm wondering if there's a simple and more efficient way to do this. Perhaps by doing the pairwise combination in multiple small areas instead of the whole map? not sure...

Have you tried to make a matrix (DeltaX,DeltaY) lookUpTable for the pairwise combination distance. It will take some long time to calc the LUT once, or you can write it in a file and load it when the algo starts.
Then you'll only have to look on correct case to have the result instead of calc each time.
You can too make a polynomial regression for the distance calc, it will be less precise but definitely faster

Maybe not an outright answer, but it's been a while. Could you not segment the lines and use minimum bounding box around each segment to assess similarities? I might be thinking of your problem the wrong way around. I'm not sure. Right now I'm trying to work with polygons from two different data sets and want to optimize the processing by first identifying the polygons in both geometries that overlap.
In your case, I think segments would you leave you with some edge artifacts. Maybe look at this paper: https://drops.dagstuhl.de/opus/volltexte/2021/14879/pdf/OASIcs-ATMOS-2021-10.pdf or this paper (with python code): https://www.austriaca.at/0xc1aa5576_0x003aba2b.pdf

Why does peakutils.peak.indexes() seem to ignore the provided threshold value?

I'm retrieving the arrays holding the power levels and frequencies, respectively, of a signal from the plt.psd() method:
Pxx, freqs = plt.psd(signals[0], NFFT=2048, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6, scale_by_freq=True, color="green")
Please ignore the green and red signals. Just the blue one is relevant for this question.
I'm able to have the peakutils.peak.indexes() method return the X and Y coordinates of a number of the most significant peaks (of the blue signal):
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.6/max(power_lvls), min_dist=120)
print("\nX: {}\n\nY: {}\n".format(freqs[indexes], np.array(power_lvls)[indexes]))
As can be seen, the coordinates fit the blue peaks quite nicely.
What I'm not satisfied with is the number of peak coordinates I receive from the peak.indexes() method. I'd like to have only the coordinates of all peaks above a certain power level returned, e.g., -25 (which would then be exactly 5 peaks for the blue signal). According to the documentation of the peak.indexes() method this is done by providing the desired value as thres parameter.
But no matter what I try as thres, the method seems to entirely ignore my value and instead solely rely on the min_dist parameter to determine the number of returned peaks.
What is wrong with my threshold value (which I believe means "peaks above the lower 60% of the plot" in my code now) and how do I correctly specify a certain power level (instead of a percentage value)?
[EDIT]
I figured out that apparently the thres parameter can only take positive values between float 0. and 1.
So, by changing my line slightly as follows I can now influence the number of returned peaks as desired:
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.4, min_dist=1)
But that still leaves me with the question whether it's possible to somehow limit the result to the five highest peaks (provided num_of_peaks above thres >= 5).
I believe something like the following would return the five highest values:
print(power_lvls[np.argsort(power_lvls[indexes])[-5:]])
Unfortunately, though, negative values seem to be interpreted as the highest values in my power_lvls array. Can this line be changed such that (+)10 would be considered higher than, e.g., -40? Or is there another (better?) solution?
[EDIT 2]
These are the values I get as the six "highest" peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
power_lvls_max = power_lvls[np.argsort(power_lvls[indexes])[-6:]]
print("Highest Peaks in Signal:\nX: \n\nY: {}\n".format(power_lvls_max))
After trying various things for hours without any improvement I'm starting to think that these are neither valleys nor peaks, just some "random" values?! Which leads me to believe that there is a problem with my argsort line that I have to figure out first?!
[EDIT 3]
The bottleneck.partition() method seems to return the correct values (even if apparently it does so in random order, not from leftmost peak to rightmost peak):
import bottleneck as bn
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
Luckily, the order of the peaks is not important for what I have planned to do with the coordinates. I do, however, have to figure out yet how to match the Y values I have now to their corresponding X values ...
Also, while I do have a solution now, for learning purposes it would still be interesting to know what was wrong with my argsort attempt.

A simple way to solve this would be to add a constant (for example +50 dB) to your Pxx vector before the processing. That way you would avoid the negative-valued peaks. After the processing is done, you can subtract the constant again to get the right peak values.

I figured it out how to find the corresponding X values and get full coordinates of the six highest peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
print("Peaks in Signal 1\nX: {}\n\nY: {}\n".format(freqs[indexes], power_lvls[indexes]))
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
check = np.isin(power_lvls, power_lvls_max)
indexes_max = np.where(check)
print("Highest Peaks in Signal 1:\nX: {}\n\nY: {}\n".format(freqs[indexes_max], power_lvls[indexes_max]))
Now I have my "peak filtering" (kind of), which I originally tried to achieve by messing around with the thres value of peakutils.peak.indexes(). The code above gives me just the desired result:

Fast great circle for multiple points - Python geopy

Is it possible to speed up the great_circle(pos1, pos2).miles from geopy if using it for multiple thousand points?
I want to create something like a distance matrix and at the moment my machine needs 5 seconds for 250,000 calculations.
Actually pos1 is always the same if it helps.
Another "restriction" in my case is that I only want all points pos2 which have a distance less than a constant x.
(The exact distance doesn't matter in my case)
Is there a fast method? Do I need to use a faster function than great_circle which is less accurate or is it possible to speed it up without losing accuracy?
Update
In my case the question is whether a point is inside a circle.
Therefore it is easily possible to first get whether a point is inside a square.
start = geopy.Point(mid_point_lat, mid_point_lon)
d = geopy.distance.VincentyDistance(miles=radius)
p_north_lat = d.destination(point=start, bearing=0).latitude
# check whether the given point lat is > p_north_lat
# and so on for east, south and west

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Modeling a graph in Python

I'm trying to solve a problem related to graphs in Python. Since its a comeptitive programming problem, I'm not using any other 3rd party packages.
The problem presents a graph in the form of a 5 X 5 square grid.
A bot is assumed to be at a user supplied position on the grid. The grid is indexed at (0,0) on the top left and (4,4) on the bottom right. Each cell in the grid is represented by any of the following 3 characters. ‘b’ (ascii value 98) indicates the bot’s current position, ‘d’ (ascii value 100) indicates a dirty cell and ‘-‘ (ascii value 45) indicates a clean cell in the grid.
For example below is a sample grid where the bot is at 0 0:
b---d
-d--d
--dd-
--d--
----d
The goal is to clean all the cells in the grid, in minimum number of steps.
A step is defined as a task, where either
i) The bot changes it position
ii) The bot changes the state of the cell (from d to -)
Assume that initially the position marked as b need not be cleaned. The bot is allowed to move UP, DOWN, LEFT and RIGHT.
My approach
I've read a couple of tutorials on graphs,and decided to model the graph as an adjacency matrix of 25 X 25 with 0 representing no paths, and 1 representing paths in the matrix (since we can move only in 4 directions). Next, I decided to apply Floyd Warshell's all pairs shortest path algorithm to it, and then sum up the values of the paths.
But I have a feeling that it won't work.
I'm in a delimma that the problem is either one of the following:
i) A Minimal Spanning Tree (which I'm unable to do, as I'm not able to model and store the grid as a graph).
ii) A* Search (Again a wild guess, but the same problem here, I'm not able to model the grid as a graph properly).
I'd be thankful if you could suggest a good approach at problems like these. Also, some hint and psuedocode about various forms of graph based problems (or links to those) would be helpful. Thank

I think you're asking two questions here.
1. How do I represent this problem as a graph in Python?
As the robot moves around, he'll be moving from one dirty square to another, sometimes passing through some clean spaces along the way. Your job is to figure out the order in which to visit the dirty squares.
# Code is untested and may contain typos. :-)
# A list of the (x, y) coordinates of all of the dirty squares.
dirty_squares = [(0, 4), (1, 1), etc.]
n = len(dirty_squares)
# Everywhere after here, refer to dirty squares by their index
# into dirty_squares.
def compute_distance(i, j):
return (abs(dirty_squares[i][0] - dirty_squares[j][0])
+ abs(dirty_squares[i][1] - dirty_squares[j][1]))
# distances[i][j] is the cost to move from dirty square i to
# dirty square j.
distances = []
for i in range(n):
distances.append([compute_distance(i, j) for j in range(n)])
# The x, y coordinates of where the robot starts.
start_node = (0, 0)
# first_move_distances[i] is the cost to move from the robot's
# start location to dirty square i.
first_move_distances = [
abs(start_node[0] - dirty_squares[i][0])
+ abs(start_node[1] - dirty_squares[i][1]))
for i in range(n)]
# order is a list of the dirty squares.
def cost(order):
if not order:
return 0 # Cleaning 0 dirty squares is free.
return (first_move_distances[order[0]]
+ sum(distances[order[i]][order[i+1]]
for i in range(len(order)-1)))
Your goal is to find a way to reorder list(range(n)) that minimizes the cost.
2. How do I find the minimum number of moves to solve this problem?
As others have pointed out, the generalized form of this problem is intractable (NP-Hard). You have two pieces of information that help constrain the problem to make it tractable:
The graph is a grid.
There are at most 24 dirty squares.
I like your instinct to use A* here. It's often good for solving find-the-minimum-number-of-moves problems. However, A* requires a fair amount of code. I think you'd be better of going with a Branch-and-Bound approach (sometimes called Branch-and-Prune), which should be almost as efficient but is much easier to implement.
The idea is to start enumerating all possible solutions using a depth-first-search, like so:
# Each list represents a sequence of dirty nodes.
[]
[1]
[1, 2]
[1, 2, 3]
[1, 3]
[1, 3, 2]
[2]
[2, 1]
[2, 1, 3]
Every time you're about to recurse into a branch, check to see if that branch is more expensive than the cheapest solution found so far. If so, you can skip the whole branch.
If that's not efficient enough, add a function to calculate a lower bound on the remaining cost. Then if cost([2]) + lower_bound(set([1, 3])) is more expensive than the cheapest solution found so far, you can skip the whole branch. The tighter lower_bound() is, the more branches you can skip.

Let's say V={v|v=b or v=d}, and get a full connected graph G(V,E). You could calculate the cost of each edge in E with a time complexity of O(n^2). Afterwards the problem becomes exactly the same as: Start at a specified vertex, and find a shortest path of G which covers V.
We call this Traveling Salesman Problem(TSP) since 1832.

The problem can certainly be stored as a graph. The cost between nodes (dirty cells) is their Manhattan distance. Ignore the cost of cleaning cells, because that total cost will be the same no matter what path taken.

This problem looks to me like the Minimum Rectilinear Steiner Tree problem. Unfortunately, the problem is NP hard, so you'll need to come up with an approximation (a Minimum Spanning Tree based on Manhattan distance), if I am correct.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.