How to compute similarities between arrays?

How to compute similarities between arrays? - python

I am trying to compute similarity between two samples.
The python functions sklearn.metrics.pairwise.cosine_similarity and
scipy.spatial.distance.cosine return results that I am not satisfied with. For example:
In the following I would have expected 0.0%, because the two samples do not have identical samples.
tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]
from scipy import spatial
res = 1-spatial.distance.cosine(tt1, tt2)
print(res)
0.9893593529663931
I would have expected 0.25% of similarity because only a single sample, the first one (1), in both arrays are the same.
tt1 = [1, 16, 4, 21]
tt2 = [1, 17, 3, 22]
from scipy import spatial
res = 1-spatial.distance.cosine(tt1, tt2)
print(res)
0.9990578001169402
In the same way we have the following where I would expect 0.5% was expected. Two identical samples (1 and 16)
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 3, 22]
res = 0.9989359418266097
Here 0.75% was expected. Three identical samples (1, 16 and 4)
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 4, 22]
res = 0.9997474232272052
Is there a way in python to achieve those expected results ?

I think you are misunderstanding what the function computes. By your description you want to compute the misclassfication error / accuracy. However, the function receives two samples u,v and computes the cosine distance between them. In your first examples:
tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]
then u=tt1 and v=tt2. The different values of the two arrays are the coordinates in the vector space these samples are in (here a 4 dimensional space) - and not different samples. Refer to function documentation and specifically to the examples at the bottom.
If each coordinate in these arrays represent a different sample then:
If order matters: (consider working with numpy arrays to begin with)
np.mean(np.array(tt1) == np.array(tt2))
If order does not matter:
len(np.intersect1d(np.array(tt1), np.array(tt2))) / len(tt1)

Those vectors are quite close together geometrically. The cosine similarity doesn't just measure whether elements are the same but how different they are.
It looks like you just want an element wise match rate?
sum([t1 == t2 for t1, t2 in zip(tt1, tt2)]) / len(tt1)
# or
np.equal(tt1, tt2).mean()

you can use numpy.intersect1d as explained in the documentation
here is an example of how I would do it with example 4#
import numpy as np
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 4, 22]
res = len(np.intersect1d(tt1, tt2)) / ((len(tt1)+len(tt2))/2)
print(res)

Related

Solving Linear Equation Using NumPy

I am trying to solve linear equations 3x+6y+7z = 10, 2x+y+8y = 11 & x+3y+7z = 22 using Python and NumPy library.
import numpy as np
a = np.array([[3, 6, 7],
[2, 1, 8],
[1, 3, 7]])
b = np.array([[10, 11, 22]])
np.linalg.solve(a, b)
but can't figure out what am I doing wrong in the above code which is causing to throw out the following error
ValueError: solve: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (m,m),(m,n)->(m,n) (size 1 is different from 3)

Your b is a 1×3 array, so the dimensions of a and b do not match. Try
b = np.array([[10], [11], [12]]) so that b is a 3×1 array, or
b = np.array([10, 11, 12]) so that b is a vector with length 3 (which, as well as just b = [10, 11, 12], is also admissible by .solve(); see the doc).
The former will result in a 3×1 array as the solution, whereas the latter will result in a vector of length 3. Probably it is better to use the latter; usually we don't really care whether a vector is a column vector or a row vector. NumPy usually handles vectors in reasonable ways.

Python: Find outliers inside a list

I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:
list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
90 to 99% of my integer values are between 1 and 20
sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more
My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.
Is there a function to filter out these special numbers?

Assuming your list is l:
If you know you want to filter a certain percentile/quantile, you can
use:
This removes bottom 10% and top 90%. Of course, you can change any of
them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
output:
[ 3 2 14 2 8 4 3 5]
If you are not sure of the percentile cut-off and are looking to
remove outliers:
You can adjust your cut-off for outliers by adjusting argument m in
function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
import numpy as np
l = np.array(l)
def reject_outliers(data, m=6.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m].tolist()
print(reject_outliers(l))
output:
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]

You can use the built-in filter() method:
lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
lst2 = list(filter(lambda x: x > 5,lst1))
print(lst2)
Output:
[14, 108, 8, 97]

So here is a method how to block out those deviators
import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
mu = 0
for i in _list:
mu += i
mu = mu/len(_list)
sigma = 0
for i in _list:
sigma += math.pow(i-mu,2)
sigma = math.sqrt(sigma/len(_list))
return sigma, mu
def frequence(x, sigma, mu):
return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))
sigma, mu = consts(_list)
new_list = []
for i in range(len(_list)):
if frequence(_list[i], sigma, mu) > 0.01:
new_list.append(i)
print(new_list)

Splitting arrays depending on unique values in an array

I currently have two arrays, one of which has several repeated values and another with unique values.
Eg array 1 : a = [1, 1, 2, 2, 3, 3]
Eg array 2 : b = [10, 11, 12, 13, 14, 15]
I was developing a code in python that looks at the first array and distinguishes the elements that are all the same and remembers the indices. A new array is created that contains the elements of array b at those indices.
Eg: As array 'a' has three unique values at positions 1,2... 3,4... 5,6, then three new arrays would be created such that it contains the elements of array b at positions 1,2... 3,4... 5,6. Thus, the result would be three new arrays:
b1 = [10, 11]
b2 = [12, 13]
b3 = [14, 15]
I have managed to develop a code, however, it only works for when there are three unique values in array 'a'. In the case there are more or less unique values in array 'a', the code has to be physically modified.
import itertools
import numpy as np
import matplotlib.tri as tri
import sys
a = [1, 1, 2, 2, 3, 3]
b = [10, 10, 20, 20, 30, 30]
b_1 = []
b_2 = []
b_3 = []
unique = []
for vals in a:
if vals not in unique:
unique.append(vals)
if len(unique) != 3:
sys.exit("More than 3 'a' values - check dimension")
for j in range(0,len(a)):
if a[j] == unique[0]:
b_1.append(c[j])
elif a[j] == unique[1]:
b_2.append(c[j])
elif a[j] == unique[2]:
b_3.append(c[j])
else:
sys.exit("More than 3 'a' values - check dimension")
print (b_1)
print (b_2)
print (b_3)
I was wondering if there is perhaps a more elegant way to perform this task such that the code is able to cope with an n number of unique values.

Well given that you are also using numpy, here's one way using np.unique. You can set return_index=True to get the indices of the unique values, and use them to split the array b with np.split:
a = np.array([1, 1, 2, 2, 3, 3])
b = np.array([10, 11, 12, 13, 14, 15])
u, s = np.unique(a, return_index=True)
np.split(b,s[1:])
Output
[array([10, 11]), array([12, 13]), array([14, 15])]

You can use the function groupby():
from itertools import groupby
from operator import itemgetter
a = [1, 1, 2, 2, 3, 3]
b = [10, 11, 12, 13, 14, 15]
[[i[1] for i in g] for _, g in groupby(zip(a, b), key=itemgetter(0))]
# [[10, 11], [12, 13], [14, 15]]

Single line chunk re-assignment

As shown in the following code, I have a chunk list x and the full list h. I want to reassign back the values stored in x in the correct positions of h.
index = 0
for t1 in range(lbp, ubp):
h[4 + t1] = x[index]
index = index + 1
Does anyone know how to write it in a single line/expression?
Disclaimer: This is part of a bigger project and I simplified the questions as much as possible. You can expect the matrix sizes to be correct but if you think I am missing something please ask for it. For testing you can use the following variable values:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4

You can use slice assignment to expand on the left-hand side and assign your x list directly to the indices of h, e.g.:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4
h[4 + lbp:4 + ubp] = x # or better yet h[4 + lbp:4 + lbp + len(x)] = x
print(h)
# [1, 2, 3, 4, 5, 6, 20, 21, 9, 10]
I'm not really sure why are you adding 4 to the indexes in your loop nor what lbp and ubp are supposed to mean, tho. Keep in mind that when you select a range like this, the list you're assigning to the range has to be of the same length as the range.

TensorFlow: argmax (-min)

I just noticed an unexpected (at least for me) behavior in TensorFlow. I thought tf.argmax (-argmin) operates on the ranks of a Tensor from outer to inner, but apparently it does not?!
Example:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
arr = np.array([[31, 23, 4, 24, 27, 34],
[18, 3, 25, 0, 6, 35],
[28, 14, 33, 22, 20, 8],
[13, 30, 21, 19, 7, 9],
[16, 1, 26, 32, 2, 29],
[17, 12, 5, 11, 10, 15]])
# arr has rank 2 and shape (6, 6)
tf.rank(arr).eval()
> 2
tf.shape(arr).eval()
> array([6, 6], dtype=int32)
tf.argmax takes two arguments: input and dimension. Since the indices of array arr are arr[rows, columns], I would expect tf.argmax(arr, 0) to return the index of the maximum element per row, while I would have expected tf.argmax(arr, 1) to return the maximum element per column. Likewise for tf.argmin.
However, the opposite is true:
tf.argmax(arr, 0).eval()
> array([0, 3, 2, 4, 0, 1])
# 0 -> 31 (arr[0, 0])
# 3 -> 30 (arr[3, 1])
# 2 -> 33 (arr[2, 2])
# ...
# thus, this is clearly searching for the maximum element
# for every column, and *not* for every row
tf.argmax(arr, 1).eval()
> array([5, 5, 2, 1, 3, 0])
# 5 -> 34 (arr[0, 5])
# 5 -> 35 (arr[1, 5])
# 2 -> 33 (arr[2, 2])
# ...
# this clearly returns the maximum element per row,
# albeit 'dimension' was set to 1
Can someone explain this behavior?
Generalized every n-dimensional Tensor t is indexed by t[i, j, k, ...]. Thus, t has rank n and shape (i, j, k, ...). Since dimension 0 corresponds to i, dimension 1 to j, and so forth. Why does tf.argmax (& -argmin) ignore this scheme?

Think of the dimension argument of tf.argmax as the axis across which you reduce. tf.argmax(arr, 0) reduces across dimension 0, i.e. the rows. Reducing across rows means that you will get the argmax of each individual column.
This might be counterintuitive, but it falls in line with the conventions used in tf.reduce_max and so on.

In an n-dimensional Tensor, any given dimension has n-1 dimensions that form a discrete 2 dimensional subspace. Following the same logic, it has n-2 3 dimensional subspaces, all the way down to n - (n-1), n dimensional subspaces. You could express any aggregation as a function within the remaining subspace(s), or across the subspace(s) that are being aggregated. Since the subspace will no longer exist after the aggregation, Tensorflow has chosen to implement it as an operation across that dimension.
Frankly, it's an implementation choice by the creators of Tensorflow, now you know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compute similarities between arrays? - python

Those vectors are quite close together geometrically. The cosine similarity doesn't just measure whether elements are the same but how different they are. It looks like you just want an element wise match rate? sum([t1 == t2 for t1, t2 in zip(tt1, tt2)]) / len(tt1) # or np.equal(tt1, tt2).mean()

you can use numpy.intersect1d as explained in the documentation here is an example of how I would do it with example 4# import numpy as np tt1 = [1, 16, 4, 21] tt2 = [1, 16, 4, 22] res = len(np.intersect1d(tt1, tt2)) / ((len(tt1)+len(tt2))/2) print(res)

Related

Solving Linear Equation Using NumPy

Python: Find outliers inside a list

Splitting arrays depending on unique values in an array

Single line chunk re-assignment

TensorFlow: argmax (-min)

Categories

Resources