How do I implement this similarity measure in Python?

How do I implement this similarity measure in Python? - python

I tried implementing the distance measure shown in the image, in Python as such:
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
A = np.asarray(A).flatten()
B = np.asarray(B).flatten()
x = np.sum(1 - np.divide((1 + np.minimum(A, B)), (1 + np.maximum(A, B))))
print("Distance: {}".format(x))
but after testing, it doesn't seem to be the right approach. The maximum value returned if there's no similarity at all between the given vectors should be 1, with 0 as perfect similiarity. A and B in the image are both vectors with size m.
Edit: forgot to add that I ignored the part for min(A, B) < 0 as that wont ever happen for my intentions

This should work. First, we create a matrix AB by stacking the columns and calculate the minimum vector AB_min and maximum vector AB_max out of that. Then, we compute D as you defined it, making use of numpy.where to specify the two conditions. After that, we sum the elements to get the D_proposed as you defined it. It gives a value of 0.9 for this example.
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
AB = np.column_stack((A,B))
AB_min = np.min(AB,1)
AB_max = np.max(AB,1)
print AB_min
print AB_max
D = np.where(AB_min >= 0.,\
1. - (1. + AB_min) / (1. + AB_max),\
1. - (1. + AB_min + abs(AB_min)) / (1. + AB_max + abs(AB_min)))
print D
D_proposed = np.sum(D)
print D_proposed

Related

Python: Find outliers inside a list

I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:
list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
90 to 99% of my integer values are between 1 and 20
sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more
My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.
Is there a function to filter out these special numbers?

Assuming your list is l:
If you know you want to filter a certain percentile/quantile, you can
use:
This removes bottom 10% and top 90%. Of course, you can change any of
them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
output:
[ 3 2 14 2 8 4 3 5]
If you are not sure of the percentile cut-off and are looking to
remove outliers:
You can adjust your cut-off for outliers by adjusting argument m in
function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
import numpy as np
l = np.array(l)
def reject_outliers(data, m=6.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m].tolist()
print(reject_outliers(l))
output:
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]

You can use the built-in filter() method:
lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
lst2 = list(filter(lambda x: x > 5,lst1))
print(lst2)
Output:
[14, 108, 8, 97]

So here is a method how to block out those deviators
import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
mu = 0
for i in _list:
mu += i
mu = mu/len(_list)
sigma = 0
for i in _list:
sigma += math.pow(i-mu,2)
sigma = math.sqrt(sigma/len(_list))
return sigma, mu
def frequence(x, sigma, mu):
return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))
sigma, mu = consts(_list)
new_list = []
for i in range(len(_list)):
if frequence(_list[i], sigma, mu) > 0.01:
new_list.append(i)
print(new_list)

Efficient way of getting average area in numpy

Is there a more efficient way in determining the averages of a certain area in a given numpy array? For simplicity, lets say I have a 5x5 array:
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
I would like to get the averages of each coordinate, with a specified area size, assuming the array wraps around. Lets say the certain area is size 2, thus anything around a certain point within distance 2 will be considered. For example, to get the average of the area from coordinate (2,2), we need to consider
2,
2, 3, 4,
2, 3, 4, 5, 6
4, 5, 6,
6,
Thus, the average will be 4.
For coordinate (4, 4) we need to consider:
6,
6, 7, 3,
6, 7, 8, 4, 5
3, 4, 0,
5,
Thus the average will be 4.92.
Currently, I have the following code below. But since I have a for loop I feel like it could be improved. Is there a way to just use numpy built in functions?
Is there a way to use np.vectorize to gather the subarrays (area), place it all in an array, then use np.einsum or something.
def get_average(matrix, loc, dist):
sum = 0
num = 0
size, size = matrix.shape
for y in range(-dist, dist + 1):
for x in range(-dist + abs(y), dist - abs(y) + 1):
y_ = (y + loc.y) % size
x_ = (x + loc.x) % size
sum += matrix[y_, x_]
num += 1
return sum/num
class Coord():
def __init__(self, x, y):
self.x = x
self.y = y
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
height, width = values.shape
averages = np.zeros((height, width), dtype=np.float16)
for r in range(height):
for c in range(width):
loc = Coord(c, r)
averages[r][c] = get_average(values, loc, 2)
print(averages)
Output:
[[ 3.07617188 2.92382812 3.5390625 4.15234375 4. ]
[ 2.92382812 2.76953125 3.38476562 4. 3.84570312]
[ 3.5390625 3.38476562 4. 4.6171875 4.4609375 ]
[ 4.15234375 4. 4.6171875 5.23046875 5.078125 ]
[ 4. 3.84570312 4.4609375 5.078125 4.921875 ]]

This solution is less efficient (slower) than yours but is just an example using numpy.ma module.
Required libraries:
import numpy as np
import numpy.ma as ma
Define methods to do the job:
# build the shape of the area as a rhomboid
def rhomboid2(dim):
size = 2*dim + 1
matrix = np.ones((size,size))
for y in range(-dim, dim + 1):
for x in range(-dim + abs(y), dim - abs(y) + 1):
matrix[(y + dim) % size, (x + dim) % size] = 0
return matrix
# build a mask using the area shaped
def mask(matrix_shape, rhom_dim):
mask = np.zeros(matrix_shape)
bound = 2*rhom_dim+1
rhom = rhomboid2(rhom_dim)
mask[0:bound, 0:bound] = rhom
# roll to set the position of the rhomboid to 0,0
mask = np.roll(mask,-rhom_dim, axis = 0)
mask = np.roll(mask,-rhom_dim, axis = 1)
return mask
Then, iterate to build the result:
mask_ = mask((5,5), 2) # call the mask sized as values array with a rhomboid area of size 2
averages = np.zeros_like(values, dtype=np.float16) # initialize the recipient
# iterate over the mask to calculate the average
for y in range(len(mask_)):
for x in range(len(mask_)):
masked = ma.array(values, mask = mask_)
averages[y,x] = np.mean(masked)
mask_ = np.roll(mask_, 1, axis = 1)
mask_ = np.roll(mask_, 1, axis = 0)
Which returns
# [[3.076 2.924 3.54 4.152 4. ]
# [2.924 2.77 3.385 4. 3.846]
# [3.54 3.385 4. 4.617 4.46 ]
# [4.152 4. 4.617 5.23 5.08 ]
# [4. 3.846 4.46 5.08 4.92 ]]

constructing arithmetic progressions from loop

I am trying to work out a program that would calculate the diagonal coefficients of pascal's triangle.
For those who are not familiar with it, the general terms of sequences are written below.
1st row = 1 1 1 1 1....
2nd row = N0(natural number) // 1 = 1 2 3 4 5 ....
3rd row = N0(N0+1) // 2 = 1 3 6 10 15 ...
4th row = N0(N0+1)(N0+2) // 6 = 1 4 10 20 35 ...
the subsequent sequences for each row follows a specific pattern and it is my goal to output those sequences in a for loop with number of units as input.
def figurate_numbers(units):
row_1 = str(1) * units
row_1_list = list(row_1)
for i in range(1, units):
sequences are
row_2 = n // i
row_3 = (n(n+1)) // (i(i+1))
row_4 = (n(n+1)(n+2)) // (i(i+1)(i+2))
>>> def figurate_numbers(4): # coefficients for 4 rows and 4 columns
[1, 1, 1, 1]
[1, 2, 3, 4]
[1, 3, 6, 10]
[1, 4, 10, 20] # desired output
How can I iterate for both n and i in one loop such that each sequence of corresponding row would output coefficients?

You can use map or a list comprehension to hide a loop.
def f(x, i):
return lambda x: ...
row = [ [1] * k ]
for i in range(k):
row[i + 1] = map( f(i), row[i])
where f is function that descpribe the dependency on previous element of row.
Other possibility adapt a recursive Fibbonachi to rows. Numpy library allows for array arifmetics so even do not need map. Also python has predefined libraries for number of combinations etc, perhaps can be used.
To compute efficiently, without nested loops, use Rational Number based solution from
https://medium.com/#duhroach/fast-fun-with-pascals-triangle-6030e15dced0 .
from fractions import Fraction
def pascalIndexInRowFast(row,index):
lastVal=1
halfRow = (row>>1)
#early out, is index < half? if so, compute to that instead
if index > halfRow:
index = halfRow - (halfRow - index)
for i in range(0, index):
lastVal = lastVal * (row - i) / (i + 1)
return lastVal
def pascDiagFast(row,length):
#compute the fractions of this diag
fracs=[1]*(length)
for i in range(length-1):
num = i+1
denom = row+1+i
fracs[i] = Fraction(num,denom)
#now let's compute the values
vals=[0]*length
#first figure out the leftmost tail of this diag
lowRow = row + (length-1)
lowRowCol = row
tail = pascalIndexInRowFast(lowRow,lowRowCol)
vals[-1] = tail
#walk backwards!
for i in reversed(range(length-1)):
vals[i] = int(fracs[i]*vals[i+1])
return vals

Don't reinvent the triangle:
>>> from scipy.linalg import pascal
>>> pascal(4)
array([[ 1, 1, 1, 1],
[ 1, 2, 3, 4],
[ 1, 3, 6, 10],
[ 1, 4, 10, 20]], dtype=uint64)
>>> pascal(4).tolist()
[[1, 1, 1, 1], [1, 2, 3, 4], [1, 3, 6, 10], [1, 4, 10, 20]]

Single line chunk re-assignment

As shown in the following code, I have a chunk list x and the full list h. I want to reassign back the values stored in x in the correct positions of h.
index = 0
for t1 in range(lbp, ubp):
h[4 + t1] = x[index]
index = index + 1
Does anyone know how to write it in a single line/expression?
Disclaimer: This is part of a bigger project and I simplified the questions as much as possible. You can expect the matrix sizes to be correct but if you think I am missing something please ask for it. For testing you can use the following variable values:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4

You can use slice assignment to expand on the left-hand side and assign your x list directly to the indices of h, e.g.:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4
h[4 + lbp:4 + ubp] = x # or better yet h[4 + lbp:4 + lbp + len(x)] = x
print(h)
# [1, 2, 3, 4, 5, 6, 20, 21, 9, 10]
I'm not really sure why are you adding 4 to the indexes in your loop nor what lbp and ubp are supposed to mean, tho. Keep in mind that when you select a range like this, the list you're assigning to the range has to be of the same length as the range.

Finding the point of a slope change as a free parameter- Python

Say I have two lists of data as follows:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
That is, it's pretty clear that merely fitting a line to this data doesn't work, but instead the slope changed at a point in the data. (Obviously, one can pinpoint from this data set pretty easily where that change is, but it's not as clear in the set I'm working with so let's ignore that.) Something with the derivative, I'm guessing, but the point here is I want to treat this as a free parameter where I say "it's this point, +/- this uncertainty, and here is the linear slope before and after this point."
Note, I can do this with an array if it's easier. Thanks!

Here is a plot of your data:
You need to find two slopes (== taking two derivatives). First, find the slope between every two points (using numpy):
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype=np.float)
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14],dtype=np.float)
m = np.diff(y)/np.diff(x)
print (m)
# [ 1. 1. 1. 1. 1. 2. 2. 2. 2.]
Clearly, slope changes from 1 to 2 in the sixth interval (between sixth and seventh points). Then take the derivative of this array, which tells you when the slope changes:
print (np.diff(m))
[ 0. 0. 0. 0. 1. 0. 0. 0.]
To find the index of the non-zero value:
idx = np.nonzero(np.diff(m))[0]
print (idx)
# 4
Since we took one derivative with respect to x, and indices start from zero in Python, idx+2 tells you that the slope is different before and after the sixth point.

I'm not sure to understand very well what you want but you can see the evolution this way (derivative):
>>> y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
>>> dy=[y[i+1]-y[i] for i in range(len(y)-1)]
>>> dy
[1, 1, 1, 1, 1, 2, 2, 2, 2]
and then find the point where it change (second derivative):
>>> dpy=[dy[i+1]-dy[i] for i in range(len(dy)-1)]
>>> dpy
[0, 0, 0, 0, 1, 0, 0, 0]
if you want the index of this point :
>>> dpy.index(1)
4
that can give you the value of the last point before change of slope :
>>> change=dpy.index(1)
>>> y[change]
5
In your y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14] the change happen at the index [4] (list indexing start to 0) and the value of y at this point is 5.

You can calculate the slope as the difference between each pair of points (the first derivative). Then check where the slope changes (the second derivative). If it changes, append the index location to idx, the collection of points where the slope changes.
Note that the first point does not have a unique slope. The second pair of points will give you the slope, but you need the third pair before you can measure the change in slope.
idx = []
prior_slope = float(y[1] - y[0]) / (x[1] - x[0])
for n in range(2, len(x)): # Start from 3rd pair of points.
slope = float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
if slope != prior_slope:
idx.append(n)
prior_slope = slope
>>> idx
[6]
Of course this could be done more efficiently in Pandas or Numpy, but I am just giving you a simple Python 2 solution.
A simple conditional list comprehension should also be pretty efficient, although it is more difficult to understand.
idx = [n for n in range(2, len(x))
if float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
!= float(y[n - 1] - y[n - 2]) / (x[n - 1] - x[n - 2])]

Knee point might be a potential solution.
from kneed import KneeLocator
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14])
kn = KneeLocator(x, y, curve='convex', direction='increasing')
# You can use array y to automatically determine 'convex' and 'increasing' if y is well-behaved
idx = (np.abs(x - kn.knee)).argmin()
>>> print(x[idx], y[idx])
6 6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I implement this similarity measure in Python? - python

Related

Python: Find outliers inside a list

Efficient way of getting average area in numpy

constructing arithmetic progressions from loop

Single line chunk re-assignment

Finding the point of a slope change as a free parameter- Python

Categories

Resources