Is there something wrong with my Python Hamming distance code? - python

I am trying to implement the Hamming distance in Python. Hamming distance is typically used to measure the distance between two codewords. The operation is simply performing exclusive OR. For example, if we have the codewords 10011101 and 10111110, then their exclusive OR would be 00100011, and the Hamming distance is said to be 1 + 1 + 1 = 3.
My code is as follows:
def hamming_distance(codeword1, codeword2):
"""Calculate the Hamming distance between two bit strings"""
assert len(codeword1) == len(codeword2)
x, y = int(codeword1, 2), int(codeword2, 2) # '2' specifies that we are reading a binary number
count, z = 0, x^y
while z:
count += 1
z &= z - 1
return count
def checking_codewords(codewords, received_data):
closestDistance = len(received_data) # set default/placeholder closest distance as the maximum possible distance.
closestCodeword = received_data # default/placeholder closest codeword
for i in codewords:
if(hamming_distance(i, received_data) < closestDistance):
closestCodeword = i
closestDistance = hamming_distance(i, received_data)
return closestCodeword
print(checking_codewords(['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'], '0001000101'))
hamming_distance(codeword1, codeword2) takes the two input parameters codeword1 and codeword2 in the form of binary values and returns the Hamming distance between the two input codewords.
checking_codewords(codewords, received_data) should determine the correct codeword IFF there are any errors in received data (i.e., the output is the corrected codeword string). Although, as you can see, I haven't added the "IFF there are any errors in received data" part yet.
I just tested the checking_codewords function with a set of examples, and it seems to have worked correctly for all of them except one. When I use the set of codewords ['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'] and the received data '0001000101' the output is 0101110101, which is apparently incorrect. Is there something wrong with my code, or is 0101110101 actually correct and there is something wrong with the example? Or was this just a case where there was no error in the received data, so my code missed it?

For my point of view, is not clear why your algorithm transforms the initial string into an integer to do a bitwise difference.
I mean, after the assert the equal length you can simply compute the difference using the zip function:
sum([c1!=c2 for c1,c2 in zip(codeword1,codeword2)])
For sum function, python consider True==1, False==0.
Doing a little simplification on your code:
def hamming_distance(codeword1, codeword2):
"""Calculate the Hamming distance between two bit strings"""
assert len(codeword1) == len(codeword2)
return sum([c1!=c2 for c1,c2 in zip(codeword1,codeword2)])
def checking_codewords(codewords, received_data):
min_dist, min_word = min([(hamming_distance(i, received_data), received_data) for i in codewords])
return min_word
print(checking_codewords(['1010111101', '0101110101', '1110101110', '0000000110', '1100101001'], '0001000101'))

Related

How to solve a Linear System of Equations in Python When the Coefficients are Unknown (but still real numbers)

Im not a programer so go easy on me please ! I have a system of 4 linear equations and 4 unknowns, which I think I could use python to solve relatively easily. However my equations not of the form " 5x+2y+z-w=0 " instead I have algebraic constants c_i which I dont know the explicit numerical value of, for example " c_1 x + c_2 y + c_3 z+ c_4w=c_5 " would be one my four equations. So does a solver exist which gives answers for x,y,z,w in terms of the c_i ?
Numpy has a function for this exact problem: numpy.linalg.solve
To construct the matrix we first need to digest the string turning it into an array of coefficients and solutions.
Finding Numbers
First we need to write a function that takes a string like "c_1 3" and returns the number 3.0. Depending on the format you want in your input string you can either iterate over all chars in this array and stop when you find a non-digit character, or you can simply split on the space and parse the second string. Here are both solutions:
def find_number(sub_expr):
"""
Finds the number from the format
number*string or numberstring.
Example:
3x -> 3
4*x -> 4
"""
num_str = str()
for char in sub_expr:
if char.isdigit():
num_str += char
else:
break
return float(num_str)
or the simpler solution
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split()[1])
Note: See edits
Get matrices
Now we can use that to split each expression into two parts: The solution and the equation by the "=". The equation is then split into sub_expressions by the "+" This way we would end turn the string "3x+4y = 3" into
sub_expressions = ["3x", "4y"]
solution_string = "3"
Each sub expression then needs to be fed into our find_numbers function. The End result can be appended to the coefficient and solution matrices:
def get_matrices(expressions):
"""
Returns coefficient_matrix and solutions from array of string-expressions.
"""
coefficient_matrix = list()
solutions = list()
last_len = -1
for expression in expressions:
# Note: In this solution all coefficients must be explicitely noted and must always be in the same order.
# Could be solved with dicts but is probably overengineered.
if not "=" in expression:
print(f"Invalid expression {expression}. Missing \"=\"")
return False
try:
c_string, s_string = expression.split("=")
c_strings = c_string.split("+")
solutions.append(float(s_string))
current_len = len(c_strings)
if last_len != -1 and current_len != last_len:
print(f"The expression {expression} has a mismatching number of coefficients")
return False
last_len = current_len
coefficients = list()
for c_string in c_strings:
coefficients.append(find_number(c_string))
coefficient_matrix.append(coefficients)
except Exception as e:
print(f"An unexpected Runtime Error occured at {coefficient}")
print(e)
exit()
return coefficient_matrix, solutions
Now let's write a simple main function to test this code:
# This is not the code you want to copy-paste
# Look further down.
from sys import argv as args
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
for row in matrix:
print(row)
print("")
print(solutions)
if __name__ == "__main__":
main()
Let's run the program in the console!
user:$ python3 solve.py 2x+3y=4 3x+3y=2
[2.0, 3.0]
[3.0, 3.0]
[4.0, 2.0]
You can see that the program identified all our numbers correctly
AGAIN: use the find_number function appropriate for your format
Put The Pieces Together
These Matrices now just need to be pumped directly into the numpy function:
# This is the main you want
from sys import argv as args
from numpy.linalg import solve as solve_linalg
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
coefficients = solve_linalg(matrix, solutions)
print(coefficients)
# This bit needs to be at the very bottom of your code to load all functions first.
# You could just paste the main-code here, but this is considered best-practice
if __name__ == '__main__':
main()
Now let's test that:
$ python3 solve.py x*2+y*4+z*0=20 x*1+y*1+z*-1=3 x*2+y*2+z*-3=3
[2. 4. 3.]
As you can see the program now solves the functions for us.
Out of curiosity: Math homework? This feels like math homework.
Edit: Had a typo "c_string" instead of "c_strings" worked out in all tests out of pure and utter luck.
Edit 2: Upon further inspection I would reccomend to split the sub-expressions by a "*":
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split("*")[1])
This results in fairly readable input strings

Find Eucledian distance between landmarks of faces

I've got multiple frames and I've detected the faces in each frame using Retinaface. I would like to keep track of the faces using their landmarks.
To find the similarity between 2 landmarks, I tried to calculate the Eucledian distance :
Input :
landmark_1 = [1828, 911], [1887, 913], [1841, 942], [1832, 974], [1876, 976]
landmark_2 = [1827, 928], [1887, 926], [1848, 963], [1836, 992], [1884, 990]
After referring other links, I wrote the below function, but the values produced are very high :
def euclidean_dist(vector_x, vector_y):
vector_x, vector_y = np.array(vector_x), np.array(vector_y)
if len(vector_x) != len(vector_y):
raise Exception('Vectors must be same dimensions')
ans = sum((vector_x[dim] - vector_y[dim]) ** 2 for dim in range(len(vector_x)))
return np.sqrt(np.sum(ans**2))
Output :
euclidean_dist(landmark_1, landmark_2)
>> 1424.9424549784458
(Expecting some smaller value in this case)
I guess the code can only be used for an one dimensional vector, but I'm really stuck here. Any help would be really appreciated.
It looks like you're squaring the answer twice (ans**2). But you can also simplify the function somewhat:
def euclidean_dist(vector_x, vector_y):
vector_x, vector_y = np.array(vector_x), np.array(vector_y)
return np.sqrt(np.sum((vector_x - vector_y)**2, axis=-1))
This will automatically raise an exception when the vectors are incompatible shapes.
EDIT: If you use axis=-1 it will sum over the last axis of the array, so you can use a 2-D array of vectors, for example.
You can use linalg.nor too.
def euclidean_dist(vector_x, vector_y):
distances = np.linalg.norm(np.array(vector_x)-np.array(vector_y), axis=1)
return distances.tolist()

Nested for loop producing more number of values than expected-Python

Background:I have two catalogues consisting of positions of spatial objects. My aim is to find the similar ones in both catalogues with a maximum difference in angular distance of certain value. One of them is called bss and another one is called super.
Here is the full code I wrote
import numpy as np
def crossmatch(bss_cat, super_cat, max_dist):
matches=[]
no_matches=[]
def find_closest(bss_cat,super_cat):
dist_list=[]
def angular_dist(ra1, dec1, ra2, dec2):
r1 = np.radians(ra1)
d1 = np.radians(dec1)
r2 = np.radians(ra2)
d2 = np.radians(dec2)
a = np.sin(np.abs(d1-d2)/2)**2
b = np.cos(d1)*np.cos(d2)*np.sin(np.abs(r1 - r2)/2)**2
rad = 2*np.arcsin(np.sqrt(a + b))
d = np.degrees(rad)
return d
for i in range(len(bss_cat)): #The problem arises here
for j in range(len(super_cat)):
distance = angular_dist(bss_cat[i][1], bss_cat[i][2], super_cat[j][1], super_cat[j][2]) #While this is supposed to produce single floating point values, it produces numpy.ndarray consisting of three entries
dist_list.append(distance) #This list now contains numpy.ndarrays instead of numpy.float values
for k in range(len(dist_list)):
if dist_list[k] < max_dist:
element = (bss_cat[i], super_cat[j], dist_list[k])
matches.append(element)
else:
element = bss_cat[i]
no_matches.append(element)
return (matches,no_matches)
When put seperately, the function angular_dist(ra1, dec1, ra2, dec2) produces a single numpy.float value as expected. But when used inside the for loop in this crossmatch(bss_cat, super_cat, max_dist) function, it produces numpy.ndarrays instead of numpy.float. I've stated this inside the code also. I don't know where the code goes wrong. Please help

Determining if string B could be the result of a deletion/addition to string A

Assume I have a base string of length 29, and a list of arbitrary strings of lengths 28 and 30. How would I determine the number of these strings which could be the result of a deletion/addition of one character performed on the base string?
I'm doing this in Python, for the record.
Let's see... I would modify the Levenshtein distance algorithm (Python code here) to make it work only in case of addition or deletion of one character.
from functools import partial
from my_distances import **add_delete_distance**
def is_accepted(base_string, alternative_string):
'''It uses the custom distance algorithm to evaluate (boolean output) if a
particular alternative string is ok with respect to the base string.'''
assert type(alternative_string) == str
len_difference = abs(len(base_string)-len(alternative_string))
if len_difference == 1 :
distance = add_delete_distance(base_string, alternative_string)
if distance == 1:
return True
return False
base_string = 'michele'
alternative_strings = ['michel', 'michelle', 'james', 'michela']
print filter(partial(is_accepted, base_string), alternative_string)
What do you think about it?

Recursive function to search through a dictionary linked to parallel arrays?

I am currently working on a recursive function to search through a dictionary that represents locations along a river. The dictionary indexes 4 parallel arrays using start as the key.
Parallel Arrays:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
Dictionary:
G = {}
for (st,sh,le,en) in zip(start,shape,length,end):
G[st] = (sh,le,en)
My goal is to search down the river from one of it's start points represented by p and select a locations at 2000 metres (represented by x) intervals until the end. This is the recursive function I'm working on with Python:
def Downstream (p, x, G):
... e = G[p]
... if (IsNull(e)):
... return ("Not Found")
... if (x < 0):
... return ("Invalid")
... if (x < e.length):
... return (Along (e.shape, x))
... return (Downstream (e.end, x-e.length,G))
Currently when I enter Downstream ("(1478475.0, 12065385.0)", 2000, G) it returns a keyerror. I have checked key in G and the key returns false, but when I search G.keys () it returns all keys represented by start including the ones that give false.
For example a key is (1478475.0, 12065385.0). I've used this key as text and a tuple of 2 double values and keyerror returned both times.
Error:
Runtime error
Trackback (most recent call last):
File “<string>”, line 1, in <module>
File “<string>”, line 1, in Downstream
KeyError: (1478475.0, 12065385.0)
What is causing the keyerror and how can I solve this issue to reach my goal?
I am using Python in ArcGIS as this is using an attribute table from a shapefile of polylines and this is my first attempt at using a recursive function.
This question and answer is how I've reached this point in organizing my data and writing this recursive function.
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
Examples:
>>> G.keys ()
[(1497315.0, 11965605.0), (1502535.0, 11967915.0), (1501785.0, 11968665.0)...
>>> print G
{(1497315.0, 11965605.0): (([1499342.3515172896, 11967472.92330054],), (7250.80302528,), (1501785.0, 11968665.0)), (1502535.0, 11967915.0): (([1502093.6057616705, 11968248.26139775],), (1218.82250994,), (1501785.0, 11968665.0)),...
Your function is not working for five main reasons:
The syntax is off - indentation is important in Python;
You don't check whether p in G each time Downstream gets called (the first key may be present, but what about later recursive calls?);
You have too many return points, for example your last line will never run (what should the function output be?);
You seem to be accessing a 3-tuple e = (sh, le, en) by attribute (e.end); and
You are subtracting from the interval length when you call recursively, so the function can't keep track of how far apart points should be - you need to separate the interval and the offset from the start.
Instead, I think you need something like (untested!):
def Downstream(start, interval, data, offset=0, out=None):
if out is None:
out = []
if start in data:
shape, length, end = data[start]
length = length[0]
if interval < length:
distance = offset
while distance < length:
out.append(Along(shape, distance))
distance += interval
distance -= interval
Downstream(end, interval, data, interval - (length - distance), out)
else:
Downstream(end, interval, data, offset - length, out)
return out
This will give you a list of whatever Along returns. If the original start not in data, it will return an empty list.

Categories