Recursive function to search through a dictionary linked to parallel arrays?

Recursive function to search through a dictionary linked to parallel arrays? - python

I am currently working on a recursive function to search through a dictionary that represents locations along a river. The dictionary indexes 4 parallel arrays using start as the key.
Parallel Arrays:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
Dictionary:
G = {}
for (st,sh,le,en) in zip(start,shape,length,end):
G[st] = (sh,le,en)
My goal is to search down the river from one of it's start points represented by p and select a locations at 2000 metres (represented by x) intervals until the end. This is the recursive function I'm working on with Python:
def Downstream (p, x, G):
... e = G[p]
... if (IsNull(e)):
... return ("Not Found")
... if (x < 0):
... return ("Invalid")
... if (x < e.length):
... return (Along (e.shape, x))
... return (Downstream (e.end, x-e.length,G))
Currently when I enter Downstream ("(1478475.0, 12065385.0)", 2000, G) it returns a keyerror. I have checked key in G and the key returns false, but when I search G.keys () it returns all keys represented by start including the ones that give false.
For example a key is (1478475.0, 12065385.0). I've used this key as text and a tuple of 2 double values and keyerror returned both times.
Error:
Runtime error
Trackback (most recent call last):
File “<string>”, line 1, in <module>
File “<string>”, line 1, in Downstream
KeyError: (1478475.0, 12065385.0)
What is causing the keyerror and how can I solve this issue to reach my goal?
I am using Python in ArcGIS as this is using an attribute table from a shapefile of polylines and this is my first attempt at using a recursive function.
This question and answer is how I've reached this point in organizing my data and writing this recursive function.
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
Examples:
>>> G.keys ()
[(1497315.0, 11965605.0), (1502535.0, 11967915.0), (1501785.0, 11968665.0)...
>>> print G
{(1497315.0, 11965605.0): (([1499342.3515172896, 11967472.92330054],), (7250.80302528,), (1501785.0, 11968665.0)), (1502535.0, 11967915.0): (([1502093.6057616705, 11968248.26139775],), (1218.82250994,), (1501785.0, 11968665.0)),...

Your function is not working for five main reasons:
The syntax is off - indentation is important in Python;
You don't check whether p in G each time Downstream gets called (the first key may be present, but what about later recursive calls?);
You have too many return points, for example your last line will never run (what should the function output be?);
You seem to be accessing a 3-tuple e = (sh, le, en) by attribute (e.end); and
You are subtracting from the interval length when you call recursively, so the function can't keep track of how far apart points should be - you need to separate the interval and the offset from the start.
Instead, I think you need something like (untested!):
def Downstream(start, interval, data, offset=0, out=None):
if out is None:
out = []
if start in data:
shape, length, end = data[start]
length = length[0]
if interval < length:
distance = offset
while distance < length:
out.append(Along(shape, distance))
distance += interval
distance -= interval
Downstream(end, interval, data, interval - (length - distance), out)
else:
Downstream(end, interval, data, offset - length, out)
return out
This will give you a list of whatever Along returns. If the original start not in data, it will return an empty list.

Related

For a large array (16,000+ rows): How to find index value of a 2D array that satisfies a certain condition, as quickly as possible?

I have a dataset having x coordinates, y coordinates, and a function value. I have a function that checks for input coordinates, and does something whether or not the value is found. But for the large sized numpy array,it takes too long, a second to check through two such arrays (x and y).
The reason it is a 'long time' is because this same checking happens few thousand times itself. Basically, I am tring to match the function value from one large grid with its own spacing in x and y to a new large grid with different spacing, using interpolation when the x, y values do not match.
Is there a faster way to find the proper index? The problem is, the new grid values cannot be assigned for all grid points at the same time like using a text file, its coordinates get passed one at a time using a function (rules of an open source library). Moreover, the read_excel() needs to be called several times too, since this function can only handle the new grid coordinates as its arguments, otherwise there are constructor errors that calls this function.
I tried this common way:
def assign_to_new_grid(p):
old_grid_values_pd = pd.read_excel('data.xlsx')
old_grid_values = np.array(old_grid_values_pd)
if p.x in old_grid_values[:,0] and p.y in old_grid_values[:,1]:
#Assign simply, 0th column has x values, 1st column has y values
else:
#Interpolate
This gets called for all thousands of points in the new grid. There's no way to use a list of the new coordinates productively, since that will only eliminate the need for the else statement, not the function as a whole.

If one of your performance problems is reading the file countless time, like that :
read_excel_count = 0
def pd_read_excel(filename): # fake for demo
global read_excel_count
read_excel_count += 1
print(f"call pd_read_excel, {filename=!r}, {read_excel_count=}")
def assign_to_new_grid(p):
old_grid_values_pd = pd_read_excel('data.xlsx')
# ...
for i in range(100):
assign_to_new_grid(...)
call pd_read_excel, filename='data.xlsx', read_excel_count=1
call pd_read_excel, filename='data.xlsx', read_excel_count=2
call pd_read_excel, filename='data.xlsx', read_excel_count=3
...
call pd_read_excel, filename='data.xlsx', read_excel_count=99
call pd_read_excel, filename='data.xlsx', read_excel_count=100
Then you can read it once and reuse it afterwards, by creating a class to store the data, and giving a method instead of a function to your framework, like so :
read_excel_count = 0 # reset
class Something:
def __init__(self, filename):
self.data = pd_read_excel(filename)
def assign_to_new_grid(self, p):
old_grid_values_pd = self.data
# ...
filename = 'data.xlsx'
assign_to_new_grid = Something(filename).assign_to_new_grid # taking a reference on the instance's method
for i in range(100):
assign_to_new_grid(...)
Which will result in only 1 call :
call pd_read_excel, filename='data.xlsx', read_excel_count=1
And there is a stdlib tool just for that : functools.lru_cache !
#lru_cache(maxsize=10) # maxsize is optional, but better be careful if you have large data files
def pd_read_excel(filename): # fake for demo
global read_excel_count
read_excel_count += 1
print(f"call pd_read_excel, {filename=!r}, {read_excel_count=}")
def assign_to_new_grid(p):
old_grid_values_pd = pd_read_excel('data.xlsx')
# ...
for i in range(100):
assign_to_new_grid(...)
WHich also get the pd_read_excel function to only be called once, the first time when the cache is not yet filled.
And it will work if you have several files too.
This is a simple "trick" that may get you a big speed boost, I know it sometimes did for me. But without a proper way to understand/reproduce your problem, I fear it will not be practical for us to search a solution to your problem.

Is there something wrong with my Python Hamming distance code?

I am trying to implement the Hamming distance in Python. Hamming distance is typically used to measure the distance between two codewords. The operation is simply performing exclusive OR. For example, if we have the codewords 10011101 and 10111110, then their exclusive OR would be 00100011, and the Hamming distance is said to be 1 + 1 + 1 = 3.
My code is as follows:
def hamming_distance(codeword1, codeword2):
"""Calculate the Hamming distance between two bit strings"""
assert len(codeword1) == len(codeword2)
x, y = int(codeword1, 2), int(codeword2, 2) # '2' specifies that we are reading a binary number
count, z = 0, x^y
while z:
count += 1
z &= z - 1
return count
def checking_codewords(codewords, received_data):
closestDistance = len(received_data) # set default/placeholder closest distance as the maximum possible distance.
closestCodeword = received_data # default/placeholder closest codeword
for i in codewords:
if(hamming_distance(i, received_data) < closestDistance):
closestCodeword = i
closestDistance = hamming_distance(i, received_data)
return closestCodeword
print(checking_codewords(['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'], '0001000101'))
hamming_distance(codeword1, codeword2) takes the two input parameters codeword1 and codeword2 in the form of binary values and returns the Hamming distance between the two input codewords.
checking_codewords(codewords, received_data) should determine the correct codeword IFF there are any errors in received data (i.e., the output is the corrected codeword string). Although, as you can see, I haven't added the "IFF there are any errors in received data" part yet.
I just tested the checking_codewords function with a set of examples, and it seems to have worked correctly for all of them except one. When I use the set of codewords ['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'] and the received data '0001000101' the output is 0101110101, which is apparently incorrect. Is there something wrong with my code, or is 0101110101 actually correct and there is something wrong with the example? Or was this just a case where there was no error in the received data, so my code missed it?

For my point of view, is not clear why your algorithm transforms the initial string into an integer to do a bitwise difference.
I mean, after the assert the equal length you can simply compute the difference using the zip function:
sum([c1!=c2 for c1,c2 in zip(codeword1,codeword2)])
For sum function, python consider True==1, False==0.
Doing a little simplification on your code:
def hamming_distance(codeword1, codeword2):
"""Calculate the Hamming distance between two bit strings"""
assert len(codeword1) == len(codeword2)
return sum([c1!=c2 for c1,c2 in zip(codeword1,codeword2)])
def checking_codewords(codewords, received_data):
min_dist, min_word = min([(hamming_distance(i, received_data), received_data) for i in codewords])
return min_word
print(checking_codewords(['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'], '0001000101'))

What is best: Global Variable or Parameter in this python function?

I have a question about the following code, but i guess applies to different functions.
This function computes the maximum path and its length for a DAG, given the Graph, source node, and end node.
To keep track of already computed distances across recursions I use "max_distances_and_paths" variable, and update it on each recursion.
Is it better to keep it as a function parameter (inputed and outputed across recursions) or
use a global variable and initialize it outside the function?
How can avoid to have this parameter returned when calling the function externally (i.e it
has to be outputed across recursions but I dont care about its value, externally)?
a better way than doing: LongestPath(G, source, end)[0:2] ??
Thanks
# for a DAG computes maximum distance and maximum path nodes sequence (ordered in reverse).
# Recursively computes the paths and distances to edges which are adjacent to the end node
# and selects the maximum one
# It will return a single maximum path (and its distance) even if there are different paths
# with same max distance
# Input {Node 1: adj nodes directed to Node 1 ... Node N: adj nodes directed to Node N}
# Example: {'g': ['r'], 'k': ['g', 'r']})
def LongestPath(G, source, end, max_distances_and_paths=None):
if max_distances_and_paths is None:
max_distances_and_paths = {}
max_path = [end]
distances_list = []
paths_list = []
# return max_distance and max_path from source to current "end" if already computed (i.e.
# present in the dictionary tracking maximum distances and correspondent distances)
if end in max_distances_and_paths:
return max_distances_and_paths[end][0], max_distances_and_paths[end][1], max_distances_and_paths
# base case, when end node equals source node
if source == end:
max_distance = 0
return max_distance, max_path, max_distances_and_paths
# if there are no adjacent nodes directed to end node (and is not the source node, previous case)
# means path is disconnected
if len(G[end]) == 0:
return 0, [0], {"": []}
# for each adjacent node pointing to end node compute recursively its max distance to source node
# and add one to get the distance to end node. Recursively add nodes included in the path
for t in G[end]:
sub_distance, sub_path, max_distances_and_paths = LongestPath(G, source, t, max_distances_and_paths)
paths_list += [[end] + sub_path]
distances_list += [1 + sub_distance]
# compute max distance
max_distance = max(distances_list)
# access the same index where max_distance is, in the list of paths, to retrieve the path
# correspondent to the max distance
index = [i for i, x in enumerate(distances_list) if x == max_distance][0]
max_path = paths_list[index]
# update the dictionary tracking maximum distances and correspondent paths from source
# node to current end node.
max_distances_and_paths.update({end: [max_distance, max_path]})
# return computed max distance, correspondent path, and tracker
return max_distance, max_path, max_distances_and_paths

Global variables are generally avoided due to several reasons (see Why are global variables evil?). I would recommend sending the parameter in this case. However, you could define a larger function housing your recursive function. Here's a quick example I wrote for a factorial code:
def a(m):
def b(m):
if m<1:return 1
return m*b(m-1)
n = b(m)
m=m+2
return n,m
print(a(6))
This will give: (720, 8). This proves that even if you used the same variable name in your recursive function, the one you passed in to the larger function will not change. In your case, you want to just return n as per my example. I only returned an edited m value to show that even though both a and b functions have m as their input, Python separates them.

In general I would say avoid the usage of global variables. This is because is makes you code harder to read and often more difficult to debug if you codebase gets a bit more complex. So it is good practice.
I would use a helper function to initialise your recursion.
def longest_path_helper(G, source, end, max_distances_and_paths=None):
max_distance, max_path, max_distances_and_paths = LongestPath(
G, source, end, max_distances_and_paths
)
return max_distance, max_path, max_distances_and_paths
On a side note, in Python it is convention to write functions without capital letters and separated with underscores and Capicalized without underscores are used for classes. So it would be more Pythonic to use def longest_path():

Nested for loop producing more number of values than expected-Python

Background:I have two catalogues consisting of positions of spatial objects. My aim is to find the similar ones in both catalogues with a maximum difference in angular distance of certain value. One of them is called bss and another one is called super.
Here is the full code I wrote
import numpy as np
def crossmatch(bss_cat, super_cat, max_dist):
matches=[]
no_matches=[]
def find_closest(bss_cat,super_cat):
dist_list=[]
def angular_dist(ra1, dec1, ra2, dec2):
r1 = np.radians(ra1)
d1 = np.radians(dec1)
r2 = np.radians(ra2)
d2 = np.radians(dec2)
a = np.sin(np.abs(d1-d2)/2)**2
b = np.cos(d1)*np.cos(d2)*np.sin(np.abs(r1 - r2)/2)**2
rad = 2*np.arcsin(np.sqrt(a + b))
d = np.degrees(rad)
return d
for i in range(len(bss_cat)): #The problem arises here
for j in range(len(super_cat)):
distance = angular_dist(bss_cat[i][1], bss_cat[i][2], super_cat[j][1], super_cat[j][2]) #While this is supposed to produce single floating point values, it produces numpy.ndarray consisting of three entries
dist_list.append(distance) #This list now contains numpy.ndarrays instead of numpy.float values
for k in range(len(dist_list)):
if dist_list[k] < max_dist:
element = (bss_cat[i], super_cat[j], dist_list[k])
matches.append(element)
else:
element = bss_cat[i]
no_matches.append(element)
return (matches,no_matches)
When put seperately, the function angular_dist(ra1, dec1, ra2, dec2) produces a single numpy.float value as expected. But when used inside the for loop in this crossmatch(bss_cat, super_cat, max_dist) function, it produces numpy.ndarrays instead of numpy.float. I've stated this inside the code also. I don't know where the code goes wrong. Please help

depth-first algorithm in python does not work

I have some project which I decide to do in Python. In brief: I have list of lists. Each of them also have lists, sometimes one-element, sometimes more. It looks like this:
rules=[
[[1],[2],[3,4,5],[4],[5],[7]]
[[1],[8],[3,7,8],[3],[45],[12]]
[[31],[12],[43,24,57],[47],[2],[43]]
]
The point is to compare values from numpy array to values from this rules (elements of rules table). We are comparing some [x][y] point to first element (e.g. 1 in first element), then, if it is true, value [x-1][j] from array with second from list and so on. Five first comparisons must be true to change value of [x][y] point. I've wrote sth like this (main function is SimulateLoop, order are switched because simulate2 function was written after second one):
def simulate2(self, i, j, w, rule):
data = Data(rule)
if w.world[i][j] in data.c:
if w.world[i-1][j] in data.n:
if w.world[i][j+1] in data.e:
if w.world[i+1][j] in data.s:
if w.world[i][j-1] in data.w:
w.world[i][j] = data.cc[0]
else: return
else: return
else: return
else: return
else: return
def SimulateLoop(self,w):
for z in range(w.steps):
for i in range(2,w.x-1):
for j in range(2,w.y-1):
for rule in w.rules:
self.simulate2(i,j,w,rule)
Data class:
class Data:
def __init__(self, rule):
self.c = rule[0]
self.n = rule[1]
self.e = rule[2]
self.s = rule[3]
self.w = rule[4]
self.cc = rule[5]
NumPy array is a object from World class. Rules is list as described above, parsed by function obtained from another program (GPL License).
To be honest it seems to work fine, but it does not. I was trying other possibilities, without luck. It is working, interpreter doesn't return any errors, but somehow values in array changing wrong. Rules are good because it was provided by program from which I've obtained parser for it (GPL license).
Maybe it will be helpful - it is Perrier's Loop, modified Langton's loop (artificial life).
Will be very thankful for any help!
)

I am not familiar with Perrier's Loop, but if you code something like famous "game life" you would have done simple mistake: store the next generation in the same array thus corrupting it.
Normally you store the next generation in temporary array and do copy/swap after the sweep, like in this sketch:
def do_step_in_game_life(world):
next_gen = zeros(world.shape) # <<< Tmp array here
Nx, Ny = world.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
neighbours = sum(world[i-1:i+2, j-1:j+2]) - world[i,j]
if neighbours < 3:
next_gen[i,j] = 0
elif ...
world[:,:] = next_gen[:,:] # <<< Saving computed next generation

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursive function to search through a dictionary linked to parallel arrays? - python

Related

For a large array (16,000+ rows): How to find index value of a 2D array that satisfies a certain condition, as quickly as possible?

Is there something wrong with my Python Hamming distance code?

What is best: Global Variable or Parameter in this python function?

Nested for loop producing more number of values than expected-Python

depth-first algorithm in python does not work

Categories

Resources