How can I compare two matrixes for similarity using Python?

How can I compare two matrixes for similarity using Python? - python

Python 3:
How can I compare two matrices of similar shape to one another?
For example, lets say we have matrix x:
1 0 1
0 0 1
1 1 0
I would like to compare this to matrix y:
1 0 1
0 0 1
1 1 1
Which would give me a score, for example, 8/9 as 8/9 of the items were the same, with the exception of that last digit that went from 0 to 1. The matrices I am dealing with are much larger, but their dimensions are consistent for comparison.
There must be a library of some sort that can do this. Any thoughts?

If you are using numpy, you can simply use np.mean() on the boolean array after comparison as follows.
import numpy as np
m1 = np.array([
[1, 0, 1],
[0, 0, 1],
[1, 1, 0],
])
m2 = np.array([
[1, 0, 1],
[0, 0, 1],
[1, 1, 1],
])
score = np.mean(m1 == m2)
print(score) # prints 0.888..

You can do this easily with numpy arrays.
import numpy as np
a = np.array([
[1, 0, 1],
[0, 0, 1],
[1, 1, 0],
])
b = np.array([
[1, 0, 1],
[0, 0, 1],
[1, 1, 1],
])
print(np.sum(a == b) / a.size)
Gives back 0.889.

If your matrices are represented using the third-party library Numpy (which provides a lot of other useful stuff for dealing with matrices, as well as any kind of rectangular, multi-dimensional array):
>>> import numpy as np
>>> x = np.array([[1,0,1],[0,0,1],[1,1,0]])
>>> y = np.array([[1,0,1],[0,0,1],[1,1,1]])
Then finding the number of corresponding equal elements is as simple as:
>>> (x == y).sum() / x.size
0.8888888888888888
This works because x == y "broadcasts" the comparison to each corresponding element pair:
>>> x == y
array([[ True, True, True],
[ True, True, True],
[ True, True, False]])
and then we add up the boolean values (converted to integer, True has a value of 1 and False has a value of 0) and divide by the total number of elements.

If you are using NumPy you can compare them and get the following output:
import numpy as np
a = np.array([[1,0,1],[0,0,1],[1,1,0]])
b = np.array([[1,0,1],[0,0,1],[1,1,1]])
print(a == b)
Out: matrix([[ True, True, True],
[ True, True, True],
[ True, True, False]],
To count the matches you can reshape the matrices to a list and count the matching values:
import numpy as np
a = np.array([[1,0,1],[0,0,1],[1,1,0]])
b = np.array([[1,0,1],[0,0,1],[1,1,1]])
res = list(np.array(a==b).reshape(-1,))
print(f'{res.count(True)}/{len(res)}')
Out: 8/9

Related

Is it possible to find the 0th index-position of a 2D numpy array (not) containing a given vaule?

Is it possible to find the 0th index-position of a 2D numpy array (not) containing a given vaule?
What I want, and expect
I have a 2D numpy array containing integers. My goal is to find the index of the array(s) that do not contain a given value (using numpy functions). Here is an example of such an array, named ortho_disc:
>>> ortho_disc
Out: [[1 1 1 0 0 0 0 0 0]
[1 0 1 1 0 0 0 0 0]
[0 0 0 0 0 0 2 2 0]]
If I wish to find the arrays not containing 2, I would expect an output of [0, 1], as the first and second array of ortho_disc does not contain the value 2.
What I have tried
I have looked into np.argwhere, np.nonzero, np.isin and np.where without expected results. My best attempt using np.where was the following:
>>> np.where(2 not in ortho_disc, [True]*3, [False]*3)
Out: [False False False]
But it does not return the expected [True, True, False]. This is especially weird after we look at the output ortho_disc's arrays evaluated by themselves:
>>> 2 not in ortho_disc[0]
Out: True
>>> 2 not in ortho_disc[1]
Out:True
>>> 2 not in ortho_disc[2]
Out: False
Using argwhere
Using np.argwhere, all I get is an empty array (not the expected [0, 1]):
>>> np.argwhere(2 not in ortho_disc)
Out: []
I suspect this is because numpy first flattens ortho_disc, then checks the truth-value of 2 not in ortho_disc?
The same empty array is returned using np.nonzero(2 not in ortho_disc).
My code
import numpy as np
ortho_disc = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 2, 2, 0,]])
polymer = 2
print(f'>>> ortho_disc \nOut:\n{ortho_disc}\n')
print(f'>>> {polymer} not in {ortho_disc[0]} \nOut: {polymer not in ortho_disc[0]}\n')
print(f'>>> {polymer} not in {ortho_disc[1]} \nOut: {polymer not in ortho_disc[1]}\n')
print(f'>>> {polymer} not in {ortho_disc[2]} \nOut: {polymer not in ortho_disc[2]}\n\n')
breakpoint = np.argwhere(polymer not in ortho_disc)
print(f'>>>np.argwhere({polymer} not in ortho_disc) \nOut: {breakpoint}\n\n\n')
Output:
>>> ortho_disc
Out:
[[1 1 1 0 0 0 0 0 0]
[1 0 1 1 0 0 0 0 0]
[0 0 0 0 0 0 2 2 0]]
>>> 2 not in [1 1 1 0 0 0 0 0 0]
Out: True
>>> 2 not in [1 0 1 1 0 0 0 0 0]
Out: True
>>> 2 not in [0 0 0 0 0 0 2 2 0]
Out: False
>>>np.argwhere(2 not in ortho_disc)
Out: []
Expected output
From the bottom two lines:
breakpoint = np.argwhere(polymer not in ortho_disc)
print(f'>>>np.argwhere({polymer} not in ortho_disc) \nOut: {breakpoint}\n\n\n')
I excpect the following output:
>>>np.argwhere(2 not in ortho_disc)
Out: [0, 1]
Summary
I would really love feedback on how to solve this issue, as I have been scratching my head over what seems to be an easy problem for ages. And as I mentioned it is important to avoid the obvious 'easy-way-out' loop over ortho_disc, preferably using numpy.
Thanks in advance!

In [13]: ortho_disc
Out[13]:
array([[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 2, 2, 0]])
In [14]: polymer = 2
In [15]: (ortho_disc != polymer).all(axis=1).nonzero()[0]
Out[15]: array([0, 1])
Breaking it down: ortho_disc != polymer is an array of bools:
In [16]: ortho_disc != polymer
Out[16]:
array([[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, False, False, True]])
We want the rows that are all True; for that, we can apply the all() method along axis 1 (i.e. along the rows):
In [17]: (ortho_disc != polymer).all(axis=1)
Out[17]: array([ True, True, False])
That's the boolean mask for the rows that do not contain polymer.
Use nonzero() to find the indices of the values that are not 0 (True is considered nonzero, False is 0):
In [19]: (ortho_disc != polymer).all(axis=1).nonzero()
Out[19]: (array([0, 1]),)
Note that nonzero() returned a tuple with length 1; in general, it returns a tuple with the same length as the number of dimensions of the array. Here the input array is 1-d. Pull out the desired result from the tuple by indexing with [0]:
In [20]: (ortho_disc != polymer).all(axis=1).nonzero()[0]
Out[20]: array([0, 1])

You can use numpy broadcasting for this. ortho_disc == 2 will return a mask of the array where each value is True if that value in the array was not 2, False if it was 2. Then, use np.all with axis=1 to condense each row into a boolean indicating whether or not that row contained only True values (True value = no 2 there):
>>> np.all(ortho_disc != 2, axis=1)
array([ True, True, False])
If you want to get the indexes out that, just np.where with the above:
>>> np.where(np.all(ortho_disc != 2, axis=1))[0]
array([0, 1])

How to count repeated elements in a numpy 2d array?

I have many very large padded numpy 2d arrays, simplified to array A, shown below. Array Z is the basic pad array:
A = np.array(([1 , 2, 3], [2, 3, 4], [0, 0, 0], [0, 0, 0], [0, 0, 0]))
Z = np.array([0, 0, 0])
How to count the number of pads in array A in the simplest / fastest pythonic way?
This works (zCount=3), but seems verbose, loopy and unpythonic:
zCount = 0
for a in A:
if a.any() == Z.any():
zCount += 1
zCount
Also tried a one-line list comprehension, which doesn't work (dont know why not):
[zCount += 1 for a in A if a.any() == Z.any()]
zCount
Also tried a list count, but 'truth value of array with more than one element is ambiguous':
list(A).count(Z)
Have searched for a simple numpy expression without success. np.count_nonzero gives full elementwise boolean for [0]. Is there a one-word / one-line counting expression for [0, 0, 0]? (My actual arrays are approx. shape (100,30) and I have up to millions of these. I am trying to deal with them in batches, so any simple time savings generating a count would be helpful). thx

Try:
>>> np.equal(A, Z).all(axis=1).sum()
3
Step by step:
>>> np.equal(A, Z)
array([[False, False, False],
[False, False, False],
[ True, True, True],
[ True, True, True],
[ True, True, True]])
>>> np.equal(A, Z).all(axis=1)
array([False, False, True, True, True])
>>> np.equal(A, Z).all(axis=1).sum()
3

Dynamic way to compute linear constraints with multiple operators

Imagine a matrix A having one column with a lot of inequality/equality operators (≥, = ≤) and a vector b, where the number of rows in A is equal the number of elements in b. Then one row, in my setting would be computed by, e.g
dot(A[0, 1:], x) ≥ b[0]
where x is some vector, column A[,0] represents all operators and we'd know that for row 0 we were suppose to calculate using ≥ operator (e.i. A[0,0] == "≥" is true). Now, is there a way for dynamically calculate all rows in following so far imaginary way
dot(A[, 1:], x) A[, 0] b
My hope was for a dynamic evaluation of each row where we evaluate which operator is used for each row.
Example, let
A = [
[">=", -2, 1, 1],
[">=", 0, 1, 0],
["==", 0, 1, 1]
]
b = [0, 1, 1]
and x be some given vector, e.g. x = [1,1,0] we wish to compute as following
A[,1:] x A[,0] b
dot([-2, 1, 1], [1, 1, 0]) >= 0
dot([0, 1, 0], [1, 1, 0]) >= 1
dot([0, 1, 1], [1, 1, 0]) == 1
The output would be [False, True, True]

If I understand correctly, this is a way to do that operation:
import numpy as np
# Input data
a = [
[">=", -2, 1, 1],
[">=", 0, 1, 0],
["==", 0, 1, 1]
]
b = np.array([0, 1, 1])
x = np.array([1, 1, 0])
# Split in comparison and data
a0 = np.array([lst[0] for lst in a])
a1 = np.array([lst[1:] for lst in a])
# Compute dot product
c = a1 # x
# Compute comparisons
leq = c <= b
eq = c == b
geq = c >= b
# Find comparison index for each row
cmps = np.array(["<=", "==", ">="]) # This array is lex sorted
cmp_idx = np.searchsorted(cmps, a0)
# Select the right result for each row
result = np.choose(cmp_idx, [leq, eq, geq])
# Convert to numeric type if preferred
result = result.astype(np.int32)
print(result)
# [0 1 1]

classify np.arrays as duplicates

My goal is to take a list of np.arrays and create an associated list or array that classifies each as having a duplicate or not. Here's what I thought would work:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
count_dict = dict(zip(uniques, counts))
[count_dict[i] for i in www]
The desired output for this case would be :
[1, 1, 0]
because the first and second element have another copy within the original list. It seems that the problem is that I cannot use a np.array as a key for a dictionary.
Suggestions?

First convert www to a 2D Numpy array then do the following:
In [18]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[18]: array([1, 1, 0])
here we use broadcasting for check the equality of all www rows with uniques array and then using all() on last axis to find out which of its rows are completely equal to uniques rows.
Here's the elaborated results:
In [20]: (www[:,None] == uniques).all(2)
Out[20]:
array([[ True, False],
[ True, False],
[False, True]])
# Respective indices in `counts` array
In [21]: np.where((www[:,None] == uniques).all(2))[1]
Out[21]: array([0, 0, 1])
In [22]: counts[np.where((www[:,None] == uniques).all(2))[1]] > 1
Out[22]: array([ True, True, False])
In [23]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[23]: array([1, 1, 0])

In Python, lists (and numpy arrays) cannot be hashed, so they can't be used as dictionary keys. But tuples can! So one option would be to convert your original list to a tuple, and to convert uniques to a tuple. The following works for me:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
www_tuples = [tuple(l) for l in www] # list of tuples
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
# convert uniques to tuples
uniques_tuples = [tuple(l) for l in uniques]
count_dict = dict(zip(uniques_tuples, counts))
[count_dict[i] for i in www_tuples]
Just a heads-up: this will double your memory consumption, so it may not be the best solution if www is large.
You can mitigate the extra memory consumption by ingesting your data as tuples instead of numpy arrays if possible.

Output fractional amount of "incorrect" values in an array with python

I have a method that will predict some data and output it to a numpy array, called Y_predict. I then have a numpy array called Y_real which stores the real values of Y that should have been predicted.
For example:
Y_predict = [1, 0, 2, 1]
Y_real = [1, 0, 1, 1]
I then want an array called errRate[] which will check if Y_predict[i] == Y_real[i]. Any value that does not match Y_real should be noted. Finally, the output should be the amount of correct predictions. In the case above, this would be 0.75 since Y_predict[2] = 2 and Y_real[2] = 1
Is there some way either in numpy or python to quickly compute this rate?

Since they're numpy arrays, this is relatively straightforward:
>>> p
array([1, 0, 2, 1])
>>> r
array([1, 0, 1, 1])
>>> p == r
array([ True, True, False, True], dtype=bool)
>>> (p == r).mean()
0.75

Given these lists:
Y_predict = [1, 0, 2, 1]
Y_real = [1, 0, 1, 1]
The easiest way I can think of is using zip() within a list comp:
Y_rate = [int(x == y) for x, y in zip(Y_predict, Y_real)] # 1 if correct, 0 if incorrect
Y_rate_correct = sum(Y_rate) / len(Y_rate)
print( Y_rate_correct ) # this will print 0.75

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I compare two matrixes for similarity using Python? - python

If you are using numpy, you can simply use np.mean() on the boolean array after comparison as follows. import numpy as np m1 = np.array([ [1, 0, 1], [0, 0, 1], [1, 1, 0], ]) m2 = np.array([ [1, 0, 1], [0, 0, 1], [1, 1, 1], ]) score = np.mean(m1 == m2) print(score) # prints 0.888..

You can do this easily with numpy arrays. import numpy as np a = np.array([ [1, 0, 1], [0, 0, 1], [1, 1, 0], ]) b = np.array([ [1, 0, 1], [0, 0, 1], [1, 1, 1], ]) print(np.sum(a == b) / a.size) Gives back 0.889.

Related

Is it possible to find the 0th index-position of a 2D numpy array (not) containing a given vaule?

How to count repeated elements in a numpy 2d array?

Dynamic way to compute linear constraints with multiple operators

classify np.arrays as duplicates

Output fractional amount of "incorrect" values in an array with python

Categories

Resources