Comparing two numpy arrays to each other

Comparing two numpy arrays to each other - python

I have two equally sized numpy arrays (they happen to be 48x365) where every element is either -1, 0, or 1. I want to compare the two and see how many times they are both the same and how many times they are different while discounting all the times where at least one of the arrays has a zero as no data. For instance:
for x in range(48):
for y in range(365):
if array1[x][y] != 0:
if array2[x][y] != 0:
if array1[x][y] == array2[x][y]:
score = score + 1
else:
score = score - 1
return score
This takes a very long time. I was thinking to take advantage of the fact that multiplying the elements together and summing all the answers may give the same outcome, and I'm looking for a special numpy function to help with that. I'm not really sure what unusual numpy function are out there.

Simpy do not iterate. Iterating over a numpy array defeats the purpose of using the tool.
ans = np.logical_and(
np.logical_and(array1 != 0, array2 != 0),
array1 == array2 )
should give the correct solution.

For me the easiest way is to do this :
A = numpy.array()
B = numpy.array()
T = A - B
max = numpy.max(numpy.abs(T))
epsilon = 1e-6
if max > epsilon:
raise Exception("Not matching arrays")
It allow to know quickly if arrays are the same and allow to compare float values !!

Simple calculations along the following lines, will help you to select the most suitable way to handle your case:
In []: A, B= randint(-1, 2, size= (48, 365)), randint(-1, 2, size= (48, 365))
In []: ignore= (0== A)| (0== B)
In []: valid= ~ignore
In []: (A[valid]== B[valid]).sum()
Out[]: 3841
In []: (A[valid]!= B[valid]).sum()
Out[]: 3849
In []: ignore.sum()
Out[]: 9830
Ensuring that the calculations are valid:
In []: 3841+ 3849+ 9830== 48* 365
Out[]: True
Therefore your score (with these random values) would be:
In []: a, b= A[valid], B[valid]
In []: score= (a== b).sum()- (a!= b).sum()
In []: score
Out[]: -8

import numpy as np
A = np.array()
B = np.array()
...
Z = np.array()
to_test = np.array([A, B, .., Z])
# compare linewise if all lines are equal
np.all(map(lambda x: np.all(x==to_test[0,:]), to_test[1:,:]))

Related

Numpy: fill conditional subarray with increasing numbers

I often come across an idiom like the following: say I have data like
N = 20 # or some other number
a = np.random.randint(0, 10, N) # or any other 1D np.array
predicate = lambda x: x%2 == 0 # or any other predicate
The idiom I encounter is along the lines
b = np.full_like(a, -1)
i1 = 0
for i, x in enumerate(a):
if predicate(x):
b[i] = i1
i1 += 1
How do I translate this to numpy? The following:
b = np.full_like(a, -1)
m = some_predicate(a)
b[m] = np.arange(np.count_nonzero(m))
looks a bit odd to me: this is three lines for such a simple task. In particular, it disturbs me that I need to store m, which I do since I need to reference it twice (because I have no way to say "arange with as many values as necessary").

Walrus operator to the rescue (starting with Python 3.8):
i = -1
b = np.array([ -1 if not predicate(val) else (i := i+1) for val in a ])
or (presumably significantly faster for large arrays)
b = np.full_like(a, -1)
b[sel] = np.arange(np.count_nonzero(sel := predicate(a)))

How can I replace all numbers of an array with all numbers of an other array except of the zeros (Python, numpy)?

I have two arrays like these:
a = [[1,2,-3],[4,5,-6],[7,8,9]]
b = [[2,-5,0],[0,4,8],[-2,1,0]]
Every number of "a" should be replaced with the one from "b", except of those, where the number of "b" is 0:
result = [[2,-5,-3],[4,4,8],[-2,1,9]]
My current solution takes way too long:
for row in range(len(b)):
for column in range(len(b[row])):
if b[row][column] != 0 or b[row][column] != -0:
a[row][column] = b[row][column]
Btw. is the "b[row][column] != -0" necessary? Since there are sometimes "0"s and sometimes "-0"s in b.
Is there a fast way?
Thanks.

Just use np.where()
a = np.array(a)
b = np.array(b)
a = np.where(b == 0, a, b)
If you want to get fancy and save memory, use np.place()
np.place(a, b != 0, b[b != 0])
EDIT: Since 0 == -0 evaluates True, you don't need any other checks

One possibility:
a[np.where(b !=0)] = b[np.where(b !=0)]

Comparing 2 arrays for tolerance

What I am trying to do is tax an array, transpose it , subtract the two arrays and then see if the difference of each cell is with a certain tolerance. I am able to get a subtracted array - but I don't know how to cycle through each item to compare the amounts - ideally I would test for floating-point near-equality; and return true - if all items are with a tolerance and false otherwise - not sure how do to this last step as well.
import numpy as np
a = np.array(([[1, 2, 3], [2, 3, 8],[ 3, 4, 1]])
b = a.transpose(1, 0)
rows = a.shape[1]
col = a.shape[0]
r = abs(np.subtract(a, b)) # abs value of 2 array
i = 0
while i < rows:
j = 0
while j < rows:
if np.any(r[i][j]) > 3: # sample using 3 as tolerance
print("false")
j += 1
print("true")
i += 1

Is this not sufficient for your needs?
tolerance = 3
result = (abs(a - b) <= tolerance).all()

In this step
r = abs(np.subtract(a, b))
you already have a matrix of distances, so all you need to do is apply comparison operator (which in numpy is applied element-wise)
errors = r > 3
which results in boolean array, and if you want to see how many elements have true value, just sum it
print( np.sum(r > 3) )
and to check if any is wrong, you can just do
print( np.sum(r > 3) > 0 ) # prints true iff any element of r is bigger than 3
There are also built-in methods, but this reasoning gives you more flexibility in expressing what is "near" or "good".

Summing values in an array less than a certain value

I have a 3x3 array with numbers and zeroes. I need to take the absolute difference between the next point, ls[i+1], and the point before it, ls[i]. Here is an example of my list:
ls=[(98.6,99,0),(98.2,98.4,97.1),(97.6,0,98.3)]
The zeroes are faulty data. I need a loop that will:
Take the absolute difference between the future number and the current number in each row,
Make the differences greater than the max difference zero
(max diff=1.9 in this case given that the zeroes are faulty data),
Sum together the differences in each row so that I'm left with a list of the sums.
As it stands now, the end result will be:
result=[(0.4,99),(0.2,1.3),(97.6,98.3)]
Given that the zeroes are not good data, differences greater than 1.9 are not an accurate result.

If you're happy with setting differences over a given maximum difference value to 0, perhaps implement that logic in a 2nd step:
ls=[(98.6,99,0),(98.2,98.4,97.1),(97.6,0,98.3)]
unfiltered = [tuple(abs(x1 - x2) for x1, x2 in zip(tup, tup[1:]))
for tup in ls]
max_diff = 1.9
results = [tuple((x if x < max_diff else 0) for x in tup)
for tup in unfiltered]
If you have objects that are not native python lists/tuples but do support indexing, it might be better to do this:
ls=[(98.6,99,0),(98.2,98.4,97.1),(97.6,0,98.3)]
unfiltered = [tuple(abs(item[i] - item[i+1]) for i in range(len(item)-1))
for item in ls]
max_diff = 1.9
results = [tuple((x if x < max_diff else 0) for x in tup)
for tup in unfiltered]

Not sure why the numbers get all messed up when doing the absolute difference, probably something to do with floating point numbers...
ls=[(98.6,99,0),(98.2,98.4,97.1),(97.6,0,98.3)]
def abs_diff(lst, max_diff=1.9):
n = len(lst)
if n < 2:
return lst
res = []
for i in range(n-1):
diff = abs(lst[i] - lst[i+1])
if diff > max_diff:
res.append(0)
else:
res.append(diff)
return res
result = map(tuple, map(abs_diff, ls))
print result
# [(0.40000000000000568, 0), (0.20000000000000284, 1.3000000000000114), (0, 0)]

This should do you. I've broken out your awkward subtraction/clearing of bad values, but you can tail recursively move through the list, building the needed values as you go, filtering out 0s.
def awkward_subtract(a, b):
if (a is None) or (b is None) or (a == 0) or (b == 0):
return 0
else:
return abs(a - b)
def compare_lists(ls):
head, *tail = ls
if not tail:
return [list(filter(int(0).__ne__, head))]
else:
values = [awkward_subtract(head[x], tail[0][x]) for x in range(0, len(head))]
return [list(filter(int(0).__ne__, values))] + compare_lists(tail)
You can test it in the REPL*:
>>> ls = [[98.6,99,0],[98.2,98.4,97.1],[97.6,0,98.3]]
>>> compare_lists(ls)
[[0.3999999999999915, 0.5999999999999943], [0.6000000000000085, 1.2000000000000028], [97.6, 98.3]]
(*) I think your test is not quite right, btw.
Note that this uses embedded lists for ease, but it is dead simple to fix that:
ts = [(98.6,99,0),(98.2,98.4,97.1),(97.6,0,98.3)]
ls = [list(t) for t in ts]

Find two pairs of pairs that sum to the same value

I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.

Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.

Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets

Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup

Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]

If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two numpy arrays to each other - python

Simpy do not iterate. Iterating over a numpy array defeats the purpose of using the tool. ans = np.logical_and( np.logical_and(array1 != 0, array2 != 0), array1 == array2 ) should give the correct solution.

For me the easiest way is to do this : A = numpy.array() B = numpy.array() T = A - B max = numpy.max(numpy.abs(T)) epsilon = 1e-6 if max > epsilon: raise Exception("Not matching arrays") It allow to know quickly if arrays are the same and allow to compare float values !!

import numpy as np A = np.array() B = np.array() ... Z = np.array() to_test = np.array([A, B, .., Z]) # compare linewise if all lines are equal np.all(map(lambda x: np.all(x==to_test[0,:]), to_test[1:,:]))

Related

Numpy: fill conditional subarray with increasing numbers

How can I replace all numbers of an array with all numbers of an other array except of the zeros (Python, numpy)?

Comparing 2 arrays for tolerance

Summing values in an array less than a certain value

Find two pairs of pairs that sum to the same value

Categories

Resources