Determine the similarity between two arrays of counts [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
The Problem: I am trying to determine the similarity between two 1D arrays composed of counts. Both the positions and relative magnitudes of the counts inside the arrays are important.
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
In this case array X is more similar to array Z than array Y.
I have tried a few metrics including cosine distance, earth movers distance and histogram intersection and while cosine distance and earth movers distance work decently, only EMD really satisfies both of my conditions
I am curious to know if there are other algorithms / distance metrics out there that exist to answer this sort of problem.
Thank you!

One popular and simple method is root-mean-square, where you sum the squares of the differences between the elements, take the square root, and divide by the number of elements, In your case, X vs Y produces 2.1, and X vs Z produces 0.4.
import math
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
def rms(a,b):
return math.sqrt( sum((a1-b1)*(a1-b1) for a1,b1 in zip(a,b)))/len(a)
print(rms(X,Y))
print(rms(X,Z))

Perhaps manhattan distance works for you. The Manhattan distance between X and Y is 26, between X and Z is 5 and between Y and Z is 23.
from math import sqrt
def manhattan(x, y):
return sum(abs(val1-val2) for val1, val2 in zip(x,y))
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
manhattan(X, Y) # returns 26
manhattan(X, Z) # returns 5
manhattan(Y,Z) # returns 23

from dtaidistance import dtw
import numpy as np
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
def phase_corr(sig1, sig2):
fft_sig1 = np.fft.fft(sig1)
fft_sig2 = np.fft.fft(sig2)
fft_sig2_conj = np.conj(fft_sig2)
R = (fft_sig1 * fft_sig2_conj) / abs(fft_sig1 * fft_sig2_conj)
r = np.fft.ifft(R)
return np.real(r)
print(np.correlate(X, Z), np.correlate(Y, Z)) #cross-correlation
print(max(phase_corr(X, Z)), max(phase_corr(Y, Z)))
print(dtw.distance(X, Z), dtw.distance(Y, Z)) #smaller distance means more similar
print(np.corrcoef(X, Z)[1,0], np.corrcoef(Y, Z)[1,0]) #Pearson correlation

Check out scipy.spatial.distance for various distance metrics.
For instance, with the Chebyshev distance, we get that X is more similar to Z than to Y.
from scipy.spatial import distance
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
print(distance.chebyshev(X, Y)) # returns 10
print(distance.chebyshev(X, Z)) # returns 2

Related

How to find a distance between elements in numpy array?

For example, I have such array z:
array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
How to find a distances between two successive 1s in this array? (measured in the numbers of 0s)
For example, in the z array, such distances are:
[1, 3, 2]
I have such code for it:
distances = []
prev_idx = 0
for idx, element in enumerate(z):
if element == 1:
distances.append(idx - prev_idx)
prev_idx = idx
distances = np.array(distances[1:]) - 1
Can this opeartion be done without for-loop and maybe in more efficient way?
UPD
The solution in the #warped answer works fine in 1-D case.
But what if z will be 2D-array like np.array([z, z])?
You can use np.where to find the ones, and then np.diff to get the distances:
q=np.where(z==1)
np.diff(q[0])-1
out:
array([1, 3, 2], dtype=int64)
edit:
for 2d arrays:
You can use the minimum of the manhattan distance (decremented by 1) of the positions that have ones to get the number of zeros inbetween:
def manhattan_distance(a, b):
return np.abs(np.array(a) - np.array(b)).sum()
zeros_between = []
r, c = np.where(z==1)
coords = list(zip(r,c))
for i, c in enumerate(coords[:-1]):
zeros_between.append(
np.min([manhattan_distance(c, coords[j])-1 for j in range(i+1, len(coords))]))
If you dont want to use the for, you can use np.where and np.roll
import numpy as np
x = np.array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
pos = np.where(x==1)[0] #pos = array([0, 2, 6, 9])
shift = np.roll(pos,-1) # shift = array([2, 6, 9, 0])
result = ((shift-pos)-1)[:-1]
#shift-pos = array([ 2, 4, 3, -9])
#(shif-pos)-1 = array([ 1, 3, 2, -10])
#((shif-pos)-1)[:-1] = array([ 1, 3, 2])
print(result)

Calculate cluster accuracy of two clustering outcomes

So say I have two clustering outcomes that look like this:
clustering = [[8, 9, 10, 11], [14, 13, 4, 7, 6, 12, 5, 15], [1, 2, 0, 3]]
correct_clustering = [[2, 8, 10, 0, 15], [12, 13, 9, 14], [11, 3, 5, 1, 4, 6, 7]]
How would I go about comparing the outcome contained in clustering to the one contained in correct_clustering. I want to have some number between 0 and 1. I was thinking about calculating the fraction of pairs which are correctly clustered together in the same cluster. But can't think of a programmatic way to solve this.
The best practice measures are indeed based on pair counting.
In particular the adjusted Rand index (ARI) is the standard measure here.
You don't actually count pairs, but the number of pairs from a set can trivially be computed using the binomial, simply (n*(n-1))>>2.
You'll need this for each cluster and each cluster intersection.
The results of all intersections are aggregated, and it is easy to see that this is invariant to the permutation of clusters (and hence to the cluster labels). The Rand index is the accuracy of predicting whether two objects a, b are in the same cluster, or in different clusters. The ARI improves this by adjusting for chance: in a very unbalanced problem, a random result can score a high accuracy, but in ARI it is close to 0 on average.
Use the Rand Index:
import numpy as np
from scipy.special import comb
def rand_index_score(clusters, classes):
tp_plus_fp = comb(np.bincount(clusters), 2).sum()
tp_plus_fn = comb(np.bincount(classes), 2).sum()
A = np.c_[(clusters, classes)]
tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
for i in set(clusters))
fp = tp_plus_fp - tp
fn = tp_plus_fn - tp
tn = comb(len(A), 2) - tp - fp - fn
return (tp + tn) / (tp + fp + fn + tn)
clusters = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
classes = [0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 2, 2, 2, 0]
rand_index_score(clusters, classes)
0.6764705882352942
You can use the function adjusted_rand_score in sklearn:
from sklearn.metrics import adjusted_rand_score
clustering = sorted((i, num) for num, lst in enumerate(clustering) for i in lst)
clustering = [i for _, i in clustering]
# [2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
correct_clustering = sorted((i, num) for num, lst in enumerate(correct_clustering) for i in lst)
correct_clustering = [i for _, i in correct_clustering]
# [0, 2, 0, 2, 2, 2, 2, 2, 0, 1, 0, 2, 1, 1, 1, 0]
ari = adjusted_rand_score(correct_clustering, clustering)
# -0.012738853503184737
The function returns values between 1 and -1 so to get a value between 0 and 1 you need to rescale:
ari_scaled = (ari + 1) / 2
# 0.49363057324840764

How to vectorize this operation

Say I have two lists (always the same length):
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
I have the following rules for intersections and unions I need to apply when comparing these lists element-wise:
# union and intersect
uni = [0]*len(l0)
intersec = [0]*len(l0)
for i in range(len(l0)):
if l0[i] == l1[i]:
uni[i] = l0[i]
intersec[i] = l0[i]
else:
intersec[i] = 0
if l0[i] == 0:
uni[i] = l1[i]
elif l1[i] == 0:
uni[i] = l0[i]
else:
uni[i] = [l0[i], l1[i]]
Thus, the desired output is:
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
intersec: [0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
While this works, I need to do this with several hundred very large lists (each, with thousands of elements), so I am looking for a way to vectorize this. I tried using np.where and various masking strategies, but that went nowhere fast. Any suggestions would be most welcome.
* EDIT *
Regarding
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
versus
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
I'm still fighting the 8 versus [0, 8] in my mind. The lists are derived from BIO tags in system annotations (see IOB labeling of text chunks), where each list element is a character index in a document and the vakue is an assigned enumerated label. 0 represents a label representing no annotation (i.e., used for determining negatives in a confusion matrix); while non zero elements represent assigned enumerated labels for that character. Since I am ignoring true negatives, I think I can say 8 is equivalent to [0, 8]. As to whether this simplifies things, I am not yet sure.
* EDIT 2 *
I'm using [0, 8] to keep things simple and to keep the definitions of intersection and union consistent with set theory.
I would stay away from calling them 'intersection' and 'union', since those operations have well-defined meanings on sets and the operation you're looking to perform is neither of them.
However, to do what you want:
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
values = [
(x
if x == y else 0,
0
if x == y == 0
else x if y == 0
else y if x == 0
else [x, y])
for x, y in zip(l0, l1)
]
result_a, result_b = map(list, zip(*values))
print(result_a)
print(result_b)
This is more than enough for thousands, or even millions of elements since the operation is so basic. Of course, if we're talking billions, you may want to look at numpy anyway.
Semi vectorized solution for union and full for intersection:
import numpy as np
l0 = np.array(l0)
l1 = np.array(l1)
intersec = np.zeros(l0.shape[0])
intersec_idx = np.where(l0==l1)
intersec[intersec_idx] = l0[intersec_idx]
intersec = intersec.astype(int).tolist()
union = np.zeros(l0.shape[0])
union_idx = np.where(l0==l1)
union[union_idx] = l0[union_idx]
no_union_idx = np.where(l0!=l1)
union = union.astype(int).tolist()
for idx in no_union_idx[0]:
union[idx] = [l0[idx], l1[idx]]
and the output:
>>> intersection
[0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
>>> union
[0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
NB: I think your original union solution is incorrect. See the last output 8 vs [0,8]

Backtracking is failing in SUDOKU

I have been trying to implement Sudoku in Python, but the backtracking is not working at all. When I input a 4x4 grid of 0's, I get output, but most of the time it fails to provide the result for a 3x3. This test case progresses correctly until it reaches the last element of the second row.
import math
solution=[[3,0,6,5,0,8,4,0,0],
[5,2,0,0,0,0,0,0,0],
[0,8,7,0,0,0,0,3,1],
[0,0,3,0,1,0,0,8,0],
[9,0,0,8,6,3,0,0,5],
[0,5,0,0,9,0,6,0,0],
[1,3,0,0,0,0,2,5,0],
[0,0,0,0,0,0,0,7,4],
[0,0,5,2,0,6,3,0,0]]
#solution=[[0 for x in range(4)] for y in range(4)]
N=9
row=0
col=0
def positionFound():
global row,col
for x in range(N):
for y in range(N):
if solution[x][y] is 0:
row,col=x,y
return row,col
return False
def isSafe(row,col,num):
global N
for c in range(N):
if solution[row][c] is num:
return False
for r in range(N):
if solution[r][col] is num:
return False
r=row-row%int(math.sqrt(N))
c=col-col%int(math.sqrt(N))
for x in range(r,r+int(math.sqrt(N))):
for y in range(c,c+int(math.sqrt(N))):
if solution[x][y] is num:
return False
return True
back=1
def sudoku(solution):
global row,col
if positionFound() is False:
print('SUCCESS')
for x in solution:
print(x)
return True
for number in range(1,N+1):
if isSafe(row,col,number):
solution[row][col]=number
if sudoku(solution) is True:
return True
solution[row][col]=0
return False
sudoku(solution)
for x in solution:
print(x)
OUTPUT:
[3, 1, 6, 5, 2, 8, 4, 9, 7]
[5, 2, 4, 1, 3, 7, 8, 6, 0]
[0, 8, 7, 0, 0, 0, 0, 3, 1]
[0, 0, 3, 0, 1, 0, 0, 8, 0]
[9, 0, 0, 8, 6, 3, 0, 0, 5]
[0, 5, 0, 0, 9, 0, 6, 0, 0]
[1, 3, 0, 0, 0, 0, 2, 5, 0]
[0, 0, 0, 0, 0, 0, 0, 7, 4]
[0, 0, 5, 2, 0, 6, 3, 0, 0]
The reason your backtracking isn't working is that you haven't implemented backtracking. Once you fail to place a number in a given location, you have no provision to return your [row, col] cursor to the previous position. You need to involve a way to know what the previous filled position was and resume with the next legal number for that position. Your recursion holds previous board positions in the stack, but you've lost the cursor position -- and your re-try loop assumes that it gets reset.
One strong possibility is to make row and col local variables, keeping them coordinated with the solution grid they describe. Make them part of the parameter passing, so the stack maintains those values for you.

Interpreting (and comparing) output from numpy.correlate

I have looked at this question but it hasn't really given me any answers.
Essentially, how can I determine if a strong correlation exists or not using np.correlate? I expect the same output as I get from matlab's xcorr with the coeff option which I can understand (1 is a strong correlation at lag l and 0 is no correlation at lag l), but np.correlate produces values greater than 1, even when the input vectors have been normalised between 0 and 1.
Example input
import numpy as np
x = np.random.rand(10)
y = np.random.rand(10)
np.correlate(x, y, 'full')
This gives the following output:
array([ 0.15711279, 0.24562736, 0.48078652, 0.69477838, 1.07376669,
1.28020871, 1.39717118, 1.78545567, 1.85084435, 1.89776181,
1.92940874, 2.05102884, 1.35671247, 1.54329503, 0.8892999 ,
0.67574802, 0.90464743, 0.20475408, 0.33001517])
How can I tell what is a strong correlation and what is weak if I don't know the maximum possible correlation value is?
Another example:
In [10]: x = [0,1,2,1,0,0]
In [11]: y = [0,0,1,2,1,0]
In [12]: np.correlate(x, y, 'full')
Out[12]: array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Edit: This was a badly asked question, but the marked answer does answer what was asked. I think it is important to note what I have found whilst digging around in this area, you cannot compare outputs from cross-correlation. In other words, it would not be valid to use the outputs from cross-correlation to say signal x is better correlated to signal y than signal z. Cross-correlation does not provide this kind of information
numpy.correlate is under-documented. I think that we can make sense of it, though. Let's start with your sample case:
>>> import numpy as np
>>> x = [0,1,2,1,0,0]
>>> y = [0,0,1,2,1,0]
>>> np.correlate(x, y, 'full')
array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Those numbers are the cross-correlations for each of the possible lags. To make that more clear, let's put the lag numbers above the correlations:
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0]])
Here, we can see that the cross-correlation reaches its peak at a lag of -1. If you look at x and y above, that makes sense: it one shifts y to the left by one place, it matches x exactly.
To verify this, let's try again, this time shifting y further:
>>> y = [0, 0, 0, 0, 1, 2]
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 2, 5, 4, 1, 0, 0, 0, 0, 0, 0]])
Now, the correlation peaks at a lag of -3, meaning that the best match between x and y occurs when y is shifted to the left by 3 places.

Categories