Efficient way of getting average area in numpy - python

Is there a more efficient way in determining the averages of a certain area in a given numpy array? For simplicity, lets say I have a 5x5 array:
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
I would like to get the averages of each coordinate, with a specified area size, assuming the array wraps around. Lets say the certain area is size 2, thus anything around a certain point within distance 2 will be considered. For example, to get the average of the area from coordinate (2,2), we need to consider
2,
2, 3, 4,
2, 3, 4, 5, 6
4, 5, 6,
6,
Thus, the average will be 4.
For coordinate (4, 4) we need to consider:
6,
6, 7, 3,
6, 7, 8, 4, 5
3, 4, 0,
5,
Thus the average will be 4.92.
Currently, I have the following code below. But since I have a for loop I feel like it could be improved. Is there a way to just use numpy built in functions?
Is there a way to use np.vectorize to gather the subarrays (area), place it all in an array, then use np.einsum or something.
def get_average(matrix, loc, dist):
sum = 0
num = 0
size, size = matrix.shape
for y in range(-dist, dist + 1):
for x in range(-dist + abs(y), dist - abs(y) + 1):
y_ = (y + loc.y) % size
x_ = (x + loc.x) % size
sum += matrix[y_, x_]
num += 1
return sum/num
class Coord():
def __init__(self, x, y):
self.x = x
self.y = y
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
height, width = values.shape
averages = np.zeros((height, width), dtype=np.float16)
for r in range(height):
for c in range(width):
loc = Coord(c, r)
averages[r][c] = get_average(values, loc, 2)
print(averages)
Output:
[[ 3.07617188 2.92382812 3.5390625 4.15234375 4. ]
[ 2.92382812 2.76953125 3.38476562 4. 3.84570312]
[ 3.5390625 3.38476562 4. 4.6171875 4.4609375 ]
[ 4.15234375 4. 4.6171875 5.23046875 5.078125 ]
[ 4. 3.84570312 4.4609375 5.078125 4.921875 ]]

This solution is less efficient (slower) than yours but is just an example using numpy.ma module.
Required libraries:
import numpy as np
import numpy.ma as ma
Define methods to do the job:
# build the shape of the area as a rhomboid
def rhomboid2(dim):
size = 2*dim + 1
matrix = np.ones((size,size))
for y in range(-dim, dim + 1):
for x in range(-dim + abs(y), dim - abs(y) + 1):
matrix[(y + dim) % size, (x + dim) % size] = 0
return matrix
# build a mask using the area shaped
def mask(matrix_shape, rhom_dim):
mask = np.zeros(matrix_shape)
bound = 2*rhom_dim+1
rhom = rhomboid2(rhom_dim)
mask[0:bound, 0:bound] = rhom
# roll to set the position of the rhomboid to 0,0
mask = np.roll(mask,-rhom_dim, axis = 0)
mask = np.roll(mask,-rhom_dim, axis = 1)
return mask
Then, iterate to build the result:
mask_ = mask((5,5), 2) # call the mask sized as values array with a rhomboid area of size 2
averages = np.zeros_like(values, dtype=np.float16) # initialize the recipient
# iterate over the mask to calculate the average
for y in range(len(mask_)):
for x in range(len(mask_)):
masked = ma.array(values, mask = mask_)
averages[y,x] = np.mean(masked)
mask_ = np.roll(mask_, 1, axis = 1)
mask_ = np.roll(mask_, 1, axis = 0)
Which returns
# [[3.076 2.924 3.54 4.152 4. ]
# [2.924 2.77 3.385 4. 3.846]
# [3.54 3.385 4. 4.617 4.46 ]
# [4.152 4. 4.617 5.23 5.08 ]
# [4. 3.846 4.46 5.08 4.92 ]]

Related

Find jaccard similarity among a batch of vectors in PyTorch

I'm having a batch of vectors of shape (bs, m, n) (i.e., bs vectors of dimensions mxn). For each batch, I would like to calculate the Jaccard similarity of the first vector with the rest (m-1) of them
Example:
a = [
[[3, 8, 6, 8, 7],
[9, 7, 4, 8, 1],
[7, 8, 8, 5, 7],
[3, 9, 9, 4, 4]],
[[7, 3, 8, 1, 7],
[3, 0, 3, 4, 2],
[9, 1, 6, 1, 6],
[2, 7, 0, 6, 6]]
]
Find pairwise jaccard similarity between a[:,0,:] and a[:,1:,:]
i.e.,
[3, 8, 6, 8, 7] with each of [[9, 7, 4, 8, 1], [7, 8, 8, 5, 7], [3, 9, 9, 4, 4]] (3 scores)
and
[7, 3, 8, 1, 7] with each of [[3, 0, 3, 4, 2], [9, 1, 6, 1, 6], [2, 7, 0, 6, 6]] (3 scores)
Here's the Jaccard function I have tried
def js(la1, la2):
combined = torch.cat((la1, la2))
union, counts = combined.unique(return_counts=True)
intersection = union[counts > 1]
torch.numel(intersection) / torch.numel(union)
While this works with unequal-sized tensors, the problem with this approach is that the number of uniques in each combination (pair of tensors) might be different and since PyTorch doesn't support jagged tensors, I'm unable to process batches of vectors at once.
If I'm not able to express the problem with the expected clarity, do let me know. Any help in this regard would be greatly appreciated
EDIT: Here's the flow achieved by iterating over the 1st and 2nd dimensions. I wish to have a vectorised version of the below code for batch processing
bs = 2
m = 4
n = 5
a = torch.randint(0, 10, (bs, m, n))
print(f"Array is: \n{a}")
for bs_idx in range(bs):
first = a[bs_idx,0,:]
for row in range(1, m):
second = a[bs_idx,row,:]
idx = js(first, second)
print(f'comparing{first} and {second}: {idx}')
I don't know how you could achieve this in pytorch, since AFAIK pytorch doesn't support set operations on tensors. In your js() implementation, union calculation should work, but intersection = union[counts > 1] doesn't give you the right result if one of the tensors contains duplicated values. Numpy on the other hand has built-on support with union1d and intersect1d. You can use numpy vectorization to calculate pairwise jaccard indices without using for-loops:
import numpy as np
def num_intersection(vec1: np.ndarray, vec2: np.ndarray) -> int:
return np.intersect1d(vec1, vec2, assume_unique=False).size
def num_union(vec1: np.ndarray, vec2: np.ndarray) -> int:
return np.union1d(vec1, vec2).size
def jaccard1d(vec1: np.ndarray, vec2: np.ndarray) -> float:
assert vec1.ndim == vec2.ndim == 1 and vec1.shape[0] == vec2.shape[0], 'vec1 and vec2 must be 1D arrays of equal length'
return num_intersection(vec1, vec2) / num_union(vec1, vec2)
jaccard2d = np.vectorize(jaccard1d, signature='(m),(n)->()')
def jaccard(vecs1: np.ndarray, vecs2: np.ndarray) -> np.ndarray:
"""
Return intersection-over-union (Jaccard index) between two sets of vectors.
Both sets of vectors are expected to be flattened to 2D, where dim 0 is the batch
dimension and dim 1 contains the flattened vectors of length V (jaccard index of
an n-dimensional vector and of its flattened 1D-vector is equal).
Args:
vecs1 (ndarray[N, V]): first set of vectors
vecs2 (ndarray[M, V]): second set of vectors
Returns:
ndarray[N, M]: the NxM matrix containing the pairwise jaccard indices for every vector in vecs1 and vecs2
"""
assert vecs1.ndim == vecs2.ndim == 2 and vecs1.shape[1] == vecs2.shape[1], 'vecs1 and vecs2 must be 2D arrays with equal length in axis 1'
return jaccard2d(vecs1, vecs2)
This is of course suboptimal because the code doesn't run on the GPU. If I run the jaccard function with vecs1 of shape (1, 10) and vecs2 of shape (10_000, 10) I get a mean loop time of 200 ms ± 1.34 ms on my machine, which should probably be fast enough for most use cases. And conversion between pytorch and numpy arrays is very cheap.
To apply this function to your problem with array a:
a = torch.tensor(a).numpy() # just to demonstrate
ious = [jaccard(batch[:1, :], batch[1:, :]) for batch in a]
np.array(ious).squeeze() # 2 batches with 3 scores each -> 2x3 matrix
# array([[0.28571429, 0.4 , 0.16666667],
# [0.14285714, 0.16666667, 0.14285714]])
Use torch.from_numpy() on the result to get a pytorch tensor again if needed.
Update:
If you need a pytorch version for calculating the Jaccard index, I partially implemented numpy's intersect1d in torch:
from torch import Tensor
def torch_intersect1d(t1: Tensor, t2: Tensor, assume_unique: bool = False) -> Tensor:
if t1.ndim > 1:
t1 = t1.flatten()
if t2.ndim > 1:
t2 = t2.flatten()
if not assume_unique:
t1 = t1.unique(sorted=True)
t2 = t2.unique(sorted=True)
# generate a m x n intersection matrix where m is numel(t1) and n is numel(t2)
intersect = t1[(t1.view(-1, 1) == t2.view(1, -1)).any(dim=1)]
if not assume_unique:
intersect = intersect.sort().values
return intersect
def torch_union1d(t1: Tensor, t2: Tensor) -> Tensor:
return torch.cat((t1.flatten(), t2.flatten())).unique()
def torch_jaccard1d(t1: Tensor, t2: Tensor) -> float:
return torch_intersect1d(t1, t2).numel() / torch_union1d(t1, t2).numel()
To vectorize the torch_jaccard1d function, you might want to look into torch.vmap, which lets you vectorize a function over an arbitrary batch dimension (similar to numpy's vectorize). The vmap function is a prototype feature and not yet available in the usual pytorch distributions, but you can get it using nightly builds of pytorch. I haven't tested it but this might work.
You can use numpy.unique , numpy.intersect1d , numpy.union1d and transform recommended function here for computing jaccard_similarity in python:
import torch
import numpy as np
def jaccard_similarity_numpy(list1, list2):
s1 = np.unique(list1)
s2 = np.unique(list2)
return (len(np.intersect1d(s1,s2))) / len(np.union1d(s1,s2))
tns_a = torch.tensor([
[[3, 8, 6, 8, 7],
[9, 7, 4, 8, 1],
[7, 8, 8, 5, 7],
[3, 9, 9, 4, 4]],
[[7, 3, 8, 1, 7],
[3, 0, 3, 4, 2],
[9, 1, 6, 1, 6],
[2, 7, 0, 6, 6]]
])
tns_b = torch.empty(tns_a.shape[0],tns_a.shape[1]-1)
for i in range(tns_a.shape[0]):
tns_tmp = tns_a[i][0]
for j , tns in enumerate(tns_a[i][1:]):
tns_b[i][j] = (jaccard_similarity_numpy(tns_tmp, tns))
print(tns_b)
Output:
tensor([[0.2857, 0.4000, 0.1667],
[0.1429, 0.1667, 0.1429]])
You don't tag numba, but you want a fast approach for computing jaccard_similarity for data with shape (45_000, 110, 12) then I highly recommend you to use numba with parallel=True. I get run_time for random data with shape (45_000, 110, 12) only 5 sec: (Run-time get on colab)
import numpy as np
import numba as nb
import torch
#nb.jit(nopython=True, parallel=True)
def jaccard_Imahdi_Numba(batch_tensor):
tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
for i in nb.prange(batch_tensor.shape[0]):
tns_tmp = batch_tensor[i][0]
for j , tns in enumerate(batch_tensor[i][1:]):
s1 = set(tns_tmp)
s2 = set(tns)
res = len(s1.intersection(s2)) / len(s1.union(s2))
tns_b[i][j] = res
return tns_b
large_tns = torch.tensor(np.random.randint(0,100, (45_000,110,12)))
%timeit jaccard_Imahdi_Numba(large_tns.numpy())
# 1 loop, best of 5: 5.37 s per loop
I write below Benchmark for 50_000 batch with shape (4,5) -> (50_000, 4, 5). We get 116 ms with numba but other approach get 8 sec and 9 sec: (Run-time get on colab)
import numpy as np
import numba as nb
import torch
def jaccard_asdf(batch_tensor):
def num_intersection(vec1: np.ndarray, vec2: np.ndarray) -> int:
return np.intersect1d(vec1, vec2, assume_unique=False).size
def num_union(vec1: np.ndarray, vec2: np.ndarray) -> int:
return np.union1d(vec1, vec2).size
def jaccard1d(vec1: np.ndarray, vec2: np.ndarray) -> float:
assert vec1.ndim == vec2.ndim == 1 and vec1.shape[0] == vec2.shape[0], 'vec1 and vec2 must be 1D arrays of equal length'
return num_intersection(vec1, vec2) / num_union(vec1, vec2)
jaccard2d = np.vectorize(jaccard1d, signature='(m),(n)->()')
def jaccard(vecs1: np.ndarray, vecs2: np.ndarray) -> np.ndarray:
assert vecs1.ndim == vecs2.ndim == 2 and vecs1.shape[1] == vecs2.shape[1], 'vecs1 and vecs2 must be 2D arrays with equal length in axis 1'
return jaccard2d(vecs1, vecs2)
a = torch.tensor(batch_tensor).numpy() # just to demonstrate
ious = [jaccard(batch[:1, :], batch[1:, :]) for batch in a]
return np.array(ious).squeeze()
def jaccard_Imahdi(batch_tensor):
def jaccard_similarity_numpy(list1, list2):
s1 = np.unique(list1)
s2 = np.unique(list2)
return (len(np.intersect1d(s1,s2))) / len(np.union1d(s1,s2))
tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
for i in range(batch_tensor.shape[0]):
tns_tmp = batch_tensor[i][0]
for j , tns in enumerate(batch_tensor[i][1:]):
tns_b[i][j] = (jaccard_similarity_numpy(tns_tmp, tns))
return tns_b
#nb.jit(nopython=True, parallel=True)
def jaccard_Imahdi_Numba(batch_tensor):
tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
for i in nb.prange(batch_tensor.shape[0]):
tns_tmp = batch_tensor[i][0]
for j , tns in enumerate(batch_tensor[i][1:]):
s1 = set(tns_tmp)
s2 = set(tns)
res = len(s1.intersection(s2)) / len(s1.union(s2))
tns_b[i][j] = res
return tns_b
small_tns = torch.tensor([
[[3, 8, 6, 8, 7],
[9, 7, 4, 8, 1],
[7, 8, 8, 5, 7],
[3, 9, 9, 4, 4]],
[[7, 3, 8, 1, 7],
[3, 0, 3, 4, 2],
[9, 1, 6, 1, 6],
[2, 7, 0, 6, 6]]
])
print(f'''output jaccard_asdf: \n{
jaccard_asdf(small_tns)
}''')
print(f'''output jaccard_Imahdi: \n{
jaccard_Imahdi(small_tns)
}''')
print(f'''output jaccard_Imahdi_Numba: \n{
jaccard_Imahdi(small_tns)
}''')
large_tns = torch.tensor(np.random.randint(0,100, (50_000,4,5)))
%timeit jaccard_Imahdi(large_tns)
# 1 loop, best of 5: 8.32 s per loop
%timeit jaccard_asdf(large_tns)
# 1 loop, best of 5: 9.92 s per loop
%timeit jaccard_Imahdi_Numba(large_tns.numpy())
# 1 loop, best of 5: 116 ms per loop
Output:
output jaccard_asdf:
[[0.28571429 0.4 0.16666667]
[0.14285714 0.16666667 0.14285714]]
output jaccard_Imahdi:
[[0.28571429 0.4 0.16666667]
[0.14285714 0.16666667 0.14285714]]
output jaccard_Imahdi_Numba:
[[0.28571429 0.4 0.16666667]
[0.14285714 0.16666667 0.14285714]]

Numpy array: get upper diagonal and lower diagonal for a given element

import numpy
square = numpy.reshape(range(0,16),(4,4))
square
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In the above array, how do I access the primary diagonal and secondary diagonal of any given element? For example 9.
by primary diagonal, I mean - [4,9,14],
by secondary diagonal, I mean - [3,6,9,12]
I can't use numpy.diag() cause it takes the entire array to get the diagonal.
Base on your description, with np.where, np.diagonal and np.fliplr
import numpy as np
x,y=np.where(square==9)
np.diagonal(square, offset=-(x-y))
Out[382]: array([ 4, 9, 14])
x,y=np.where(np.fliplr(square)==9)
np.diagonal(np.fliplr(square), offset=-(x-y))
# base on the op's comment it should be np.diagonal(np.fliplr(square), offset=-(x-y))
Out[396]: array([ 3, 6, 9, 12])
For the first diagonal, use the fact that both x_coordiante and y_coordinate increase with 1 each step:
def first_diagonal(x, y, length_array):
if x < y:
return zip(range(x, length_array), range(length_array - x))
else:
return zip(range(length_array - y), range(y, length_array))
For the secondary diagonal, use the fact that the x_coordinate + y_coordinate = constant.
def second_diagonal(x, y, length_array):
tot = x + y
return zip(range(tot+1), range(tot, -1, -1))
This gives you two lists you can use to access your matrix.
Of course, if you have a non square matrix these functions will have to be reshaped a bit.
To illustrate how to get the desired output:
a = np.reshape(range(0,16),(4,4))
first = first_diagonal(1, 2, len(a))
second = second_diagonal(1,2, len(a))
primary_diagonal = [a[i[0]][i[1]] for i in first]
secondary_diagonal = [a[i[0]][i[1]] for i in second]
print(primary_diagonal)
print(secondary_diagonal)
this outputs:
[4, 9, 14]
[3, 6, 9, 12]

Plotting binary data in python

I have some data that looks like:
data = [1,2,4,5,9] (random pattern of increasing integers)
And I want to plot it in a binary horizontal line so that y=1 for every x value specified in data and zero otherwise.
I have a few different data arrays that I'd like to stack, similar to this style (this is CCD clocking data but the plot format looks ideal)
I think I need to create a list of ones for my data array, but how do I specify the zero value for everything not in the array?
Thanks
You got the point. You can create a list with 1 in any position specified in data and 0 elsewhere. This can be done very easily with a list comprehension
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
which will act like this:
>>> data = [1, 2, 4, 5, 9]
>>> bindata = binary_data(data)
>>> bindata
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
Now all you have to do is plot it... or better step it since it's binary data and step() looks way better:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
data = [1, 2, 4, 5, 9]
bindata = binary_data(data)
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
step(xaxis, yaxis)
show()
To plot multiple data arrays stacked on the same figure you could tweak binary_data() like this:
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
so now you can set yshift parameter to shift data arrays on the y-axis. E.g,
>>> data = [1, 2, 4, 5, 9]
>>> bindata1 = binary_data(data)
>>> bindata1
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
>>> bindata2 = binary_data(data, 2)
>>> bindata2
[2, 3, 3, 2, 3, 3, 2, 2, 2, 3]
Let's say you have data1, data2 and data3 to plot stacked, you'd go like:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
data1 = [1, 2, 4, 5, 9]
bindata1 = binary_data(data1)
x1 = np.arange(0, data1[-1] + 1)
y1 = np.array(bindata1)
data2 = [1, 4, 9]
bindata2 = binary_data(data2, 2)
x2 = np.arange(0, data2[-1] + 1)
y2 = np.array(bindata2)
data3 = [1, 2, 8, 9]
bindata3 = binary_data(data3, 4)
x3 = np.arange(0, data3[-1] + 1)
y3 = np.array(bindata3)
step(x1, y1, x2, y2, x3, y3)
show()
that you can easily edit to make it work with an arbitrary amount of data arrays:
data = [ [1, 2, 4, 5, 9],
[1, 4, 9],
[1, 2, 8, 9] ]
for shift, d in enumerate(data):
bindata = binary_data(d, 2 * shift)
x = np.arange(0, d[-1] + 1)
y = np.array(bindata)
step(x, y)
show()
Finally if you are dealing with data arrays with different length (say [1,2] and [15,16]) and you don't like plots that vanish in the middle of the figure you can tweak binary_data() again to force its range to the maximum range of your data.
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, limit, yshift=0):
return [yshift+1 if x in data else yshift for x in range(limit)]
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
bindata = binary_data(d, limit, 2 * shift)
y = np.array(bindata)
step(x, y)
show()
Edit: As #ImportanceOfBeingErnest suggested, if you prefer to perform data to bindata conversion without having to define your own binary_data() function you could use numpy.zeros_like(). Just pay more attention when you stack them:
import numpy as np
from matplotlib.pyplot import step, show
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
y = np.zeros_like(x)
y[d] = 1
# don't forget to shift
y += 2*shift
step(x, y)
show()
You can create an array with all zeros and assign 1 for those elements in data
import numpy as np
data = [1,2,4,5,9]
t = np.arange(0,data[-1]+1)
x = np.zeros_like(t)
x[data] = 1
You might then plot it with the step function
import matplotlib.pyplot as plt
plt.step(t,x, where="post")
plt.show()
or with where = "pre", depending on how to interprete your data

Finding the point of a slope change as a free parameter- Python

Say I have two lists of data as follows:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
That is, it's pretty clear that merely fitting a line to this data doesn't work, but instead the slope changed at a point in the data. (Obviously, one can pinpoint from this data set pretty easily where that change is, but it's not as clear in the set I'm working with so let's ignore that.) Something with the derivative, I'm guessing, but the point here is I want to treat this as a free parameter where I say "it's this point, +/- this uncertainty, and here is the linear slope before and after this point."
Note, I can do this with an array if it's easier. Thanks!
Here is a plot of your data:
You need to find two slopes (== taking two derivatives). First, find the slope between every two points (using numpy):
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype=np.float)
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14],dtype=np.float)
m = np.diff(y)/np.diff(x)
print (m)
# [ 1. 1. 1. 1. 1. 2. 2. 2. 2.]
Clearly, slope changes from 1 to 2 in the sixth interval (between sixth and seventh points). Then take the derivative of this array, which tells you when the slope changes:
print (np.diff(m))
[ 0. 0. 0. 0. 1. 0. 0. 0.]
To find the index of the non-zero value:
idx = np.nonzero(np.diff(m))[0]
print (idx)
# 4
Since we took one derivative with respect to x, and indices start from zero in Python, idx+2 tells you that the slope is different before and after the sixth point.
I'm not sure to understand very well what you want but you can see the evolution this way (derivative):
>>> y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
>>> dy=[y[i+1]-y[i] for i in range(len(y)-1)]
>>> dy
[1, 1, 1, 1, 1, 2, 2, 2, 2]
and then find the point where it change (second derivative):
>>> dpy=[dy[i+1]-dy[i] for i in range(len(dy)-1)]
>>> dpy
[0, 0, 0, 0, 1, 0, 0, 0]
if you want the index of this point :
>>> dpy.index(1)
4
that can give you the value of the last point before change of slope :
>>> change=dpy.index(1)
>>> y[change]
5
In your y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14] the change happen at the index [4] (list indexing start to 0) and the value of y at this point is 5.
You can calculate the slope as the difference between each pair of points (the first derivative). Then check where the slope changes (the second derivative). If it changes, append the index location to idx, the collection of points where the slope changes.
Note that the first point does not have a unique slope. The second pair of points will give you the slope, but you need the third pair before you can measure the change in slope.
idx = []
prior_slope = float(y[1] - y[0]) / (x[1] - x[0])
for n in range(2, len(x)): # Start from 3rd pair of points.
slope = float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
if slope != prior_slope:
idx.append(n)
prior_slope = slope
>>> idx
[6]
Of course this could be done more efficiently in Pandas or Numpy, but I am just giving you a simple Python 2 solution.
A simple conditional list comprehension should also be pretty efficient, although it is more difficult to understand.
idx = [n for n in range(2, len(x))
if float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
!= float(y[n - 1] - y[n - 2]) / (x[n - 1] - x[n - 2])]
Knee point might be a potential solution.
from kneed import KneeLocator
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14])
kn = KneeLocator(x, y, curve='convex', direction='increasing')
# You can use array y to automatically determine 'convex' and 'increasing' if y is well-behaved
idx = (np.abs(x - kn.knee)).argmin()
>>> print(x[idx], y[idx])
6 6

How do I implement this similarity measure in Python?

I tried implementing the distance measure shown in the image, in Python as such:
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
A = np.asarray(A).flatten()
B = np.asarray(B).flatten()
x = np.sum(1 - np.divide((1 + np.minimum(A, B)), (1 + np.maximum(A, B))))
print("Distance: {}".format(x))
but after testing, it doesn't seem to be the right approach. The maximum value returned if there's no similarity at all between the given vectors should be 1, with 0 as perfect similiarity. A and B in the image are both vectors with size m.
Edit: forgot to add that I ignored the part for min(A, B) < 0 as that wont ever happen for my intentions
This should work. First, we create a matrix AB by stacking the columns and calculate the minimum vector AB_min and maximum vector AB_max out of that. Then, we compute D as you defined it, making use of numpy.where to specify the two conditions. After that, we sum the elements to get the D_proposed as you defined it. It gives a value of 0.9 for this example.
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
AB = np.column_stack((A,B))
AB_min = np.min(AB,1)
AB_max = np.max(AB,1)
print AB_min
print AB_max
D = np.where(AB_min >= 0.,\
1. - (1. + AB_min) / (1. + AB_max),\
1. - (1. + AB_min + abs(AB_min)) / (1. + AB_max + abs(AB_min)))
print D
D_proposed = np.sum(D)
print D_proposed

Categories