Matrix row difference, output a boolean vector

Matrix row difference, output a boolean vector - python

I have an m x 3 matrix A and its row subset B (n x 3). Both are sets of indices into another, large 4D matrix; their data type is dtype('int64'). I would like to generate a boolean vector x, where x[i] = True if B does not contain row A[i,:].
There are no duplicate rows in either A or B.
I was wondering if there's an efficient way how to do this in Numpy? I found an answer that's somewhat related: https://stackoverflow.com/a/11903368/265289; however, it returns the actual rows (not a boolean vector).

You could follow the same pattern as shown in jterrace's answer, except use np.in1d instead of np.setdiff1d:
import numpy as np
np.random.seed(2015)
m, n = 10, 5
A = np.random.randint(10, size=(m,3))
B = A[np.random.choice(m, n, replace=False)]
print(A)
# [[2 2 9]
# [6 8 5]
# [7 8 0]
# [6 7 8]
# [3 8 6]
# [9 2 3]
# [1 2 6]
# [2 9 8]
# [5 8 4]
# [8 9 1]]
print(B)
# [[2 2 9]
# [1 2 6]
# [2 9 8]
# [3 8 6]
# [9 2 3]]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
print(using_view(A, B, assume_unique=True))
yields
[False True True True False False False False True True]
You can use assume_unique=True (which can speed up the calculation) since
there are no duplicate rows in A or B.
Beware that A.view(...) will raise
ValueError: new type not compatible with array.
if A.flags['C_CONTIGUOUS'] is False (i.e. if A is not a C-contiguous array).
Therefore, in general we need to use np.ascontiguous(A) before calling view.
As B.M. suggests, you could instead view each row using the "void"
dtype:
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd, assume_unique=True)
This is safe to use with integer dtypes. However, note that
In [342]: np.array([-0.], dtype='float64').view('V8') == np.array([0.], dtype='float64').view('V8')
Out[342]: array([False], dtype=bool)
so using np.in1d after viewing as void may return incorrect results for arrays
with float dtype.
Here is a benchmark of some of the proposed methods:
import numpy as np
np.random.seed(2015)
m, n = 10000, 5000
# Note A may contain duplicate rows,
# so don't use assume_unique=True for these benchmarks.
# In this case, using assume_unique=False does not improve the speed much anyway.
A = np.random.randint(10, size=(2*m,3))
# make A not C_CONTIGUOUS; the view methods fail for non-contiguous arrays
A = A[::2]
B = A[np.random.choice(m, n, replace=False)]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
from scipy.spatial import distance
def using_distance(A, B):
return ~np.any(distance.cdist(A,B)==0,1)
from functools import reduce
def using_loop(A, B):
pred = lambda i: A[:, i:i+1] == B[:, i]
return ~reduce(np.logical_and, map(pred, range(A.shape[1]))).any(axis=1)
from pandas.core.groupby import get_group_index, _int64_overflow_possible
from functools import partial
def using_pandas(A, B):
shape = [1 + max(A[:, i].max(), B[:, i].max()) for i in range(A.shape[1])]
assert not _int64_overflow_possible(shape)
encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
a1, b1 = map(encode, (A.T, B.T))
return ~np.in1d(a1, b1)
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd)
# Sanity check: make sure all the functions return the same result
for func in (using_distance, using_loop, using_pandas, using_void):
assert (func(A, B) == using_view(A, B)).all()
In [384]: %timeit using_pandas(A, B)
100 loops, best of 3: 1.99 ms per loop
In [381]: %timeit using_void(A, B)
100 loops, best of 3: 6.72 ms per loop
In [378]: %timeit using_view(A, B)
10 loops, best of 3: 35.6 ms per loop
In [383]: %timeit using_loop(A, B)
1 loops, best of 3: 342 ms per loop
In [379]: %timeit using_distance(A, B)
1 loops, best of 3: 502 ms per loop

since there are only 3 columns, one solution would be to just reduce accross columns:
>>> a
array([[2, 2, 9],
[6, 8, 5],
[7, 8, 0],
[6, 7, 8],
[3, 8, 6],
[9, 2, 3],
[1, 2, 6],
[2, 9, 8],
[5, 8, 4],
[8, 9, 1]])
>>> b
array([[2, 2, 9],
[1, 2, 6],
[2, 9, 8],
[3, 8, 6],
[9, 2, 3]])
>>> from functools import reduce
>>> pred = lambda i: a[:, i:i+1] == b[:,i]
>>> reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
though this would create an m x n intermediate array which may not be memory efficient.
Alternatively, if the values are indices, i.e. non-negative integers, you may use pandas.groupby.get_group_index to reduce to one dimensional arrays. This is an efficient algorithm which pandas use internally for groupby operations; The only caveat is that you may need to verify that there will not be any integer overflow:
>>> from pandas.core.groupby import get_group_index, _int64_overflow_possible
>>> from functools import partial
>>> shape = [1 + max(a[:, i].max(), b[:, i].max()) for i in range(a.shape[1])]
>>> assert not _int64_overflow_possible(shape)
>>> encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
>>> a1, b1 = map(encode, (a.T, b.T))
>>> np.in1d(a1, b1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)

You can treat A and B as two sets of XYZ arrays and calculate the euclidean distances between them with scipy.spatial.distance.cdist. The zero distances would be of interest to us. This distance calculation is supposed to be a pretty efficient implementation, so hopefully we would have an efficient solution to solve our case. So, the implementation to find such a boolean output would look like this -
from scipy.spatial import distance
out = ~np.any(distance.cdist(A,B)==0,1)
# OR np.all(distance.cdist(A,B)!=0,1)
Sample run -
In [582]: A
Out[582]:
array([[0, 2, 2],
[1, 0, 3],
[3, 3, 3],
[2, 0, 3],
[2, 0, 1],
[1, 1, 1]])
In [583]: B
Out[583]:
array([[2, 0, 3],
[2, 3, 3],
[1, 1, 3],
[2, 0, 1],
[0, 2, 2],
[2, 2, 2],
[1, 2, 3]])
In [584]: out
Out[584]: array([False, True, True, False, False, True], dtype=bool)

Related

iterating via np.nditer function for numpy arrays

I'm a bit new to python and wanted to check for some values in my arrays to see if they go above or below a certain value and adjust them afterwards.
For the case of a 2d array with numpy I found this in some part of its manual.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for x in np.nditer(arr[:, ::2]):
print(x)
What's the syntax in python to change that initial value so it doesnt iterate over every value starting from the first but from one I can define such iterate from every 2nd or 3rd as I need to check every 1st, 2nd, 3rd and so on value in my arrays against a different value or is there maybe a better way to do this?

I suspect you need to read some more basic numpy docs.
You created a 2d array:
In [5]: arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
In [6]: arr
Out[6]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
You can view it as a 1d array (nditer iterates as flat)
In [7]: arr.ravel()
Out[7]: array([1, 2, 3, 4, 5, 6, 7, 8])
You can use standard slice notation to select everyother - on the flattened:
In [8]: arr.ravel()[::2]
Out[8]: array([1, 3, 5, 7])
or every other column of the original:
In [9]: arr[:,::2]
Out[9]:
array([[1, 3],
[5, 7]])
You can test every value, such as for being odd:
In [10]: arr % 2 == 1
Out[10]:
array([[ True, False, True, False],
[ True, False, True, False]])
and use that array to select those values:
In [11]: arr[arr % 2 == 1]
Out[11]: array([1, 3, 5, 7])
or modify them:
In [12]: arr[arr % 2 == 1] += 10
In [13]: arr
Out[13]:
array([[11, 2, 13, 4],
[15, 6, 17, 8]])
The documentation for nditer tends to overhype it. It's useful in compiled code, but rarely useful in python. If the above whole-array methods don't work, you can iterate directly, with more control and understanding:
In [14]: for row in arr:
...: print(row)
...: for x in row:
...: print(x)
[11 2 13 4]
11
2
13
4
[15 6 17 8]
15
6
17
8

Check if 2d array exists in 3d array in Python?

I have an 3d array with shape (1000, 12, 30), and I have a list of 2d array's of shape (12, 30), what I want to do is check if these 2d arrays exist in the 3d array. Is there a simple way in Python to do this? I tried keyword in but it doesn't work.

There is a way in numpy , you can do with np.all
a = np.random.rand(3, 1, 2)
b = a[1][0]
np.all(np.all(a == b, 1), 1)
Out[612]: array([False, True, False])
Solution from bnaecker
np.all(a == b, axis=(1, 2))
If only want to check exit or not
np.any(np.all(a == b, axis=(1, 2)))

Here is a fast method (previously used by #DanielF as well as #jaime and others, no doubt) that uses a trick to benefit from short-circuiting: view-cast template-sized blocks to single elements of dtype void. When comparing two such blocks numpy stops after the first difference, yielding a huge speed advantage.
>>> def in_(data, template):
... dv = data.reshape(data.shape[0], -1).view(f'V{data.dtype.itemsize*np.prod(data.shape[1:])}').ravel()
... tv = template.ravel().view(f'V{template.dtype.itemsize*template.size}').reshape(())
... return (dv==tv).any()
Example:
>>> a = np.random.randint(0, 100, (1000, 12, 30))
>>> check = a[np.random.randint(0, 1000, (10,))]
>>> check += np.random.random(check.shape) < 0.001
>>>
>>> [in_(a, c) for c in check]
[True, True, True, False, False, True, True, True, True, False]
# compare to other method
>>> (a==check[:, None]).all((-1,-2)).any(-1)
array([ True, True, True, False, False, True, True, True, True,
False])
Gives same result as "direct" numpy approach, but is almost 20x faster:
>>> from timeit import timeit
>>> kwds = dict(globals=globals(), number=100)
>>>
>>> timeit("(a==check[:, None]).all((-1,-2)).any(-1)", **kwds)
0.4793281531892717
>>> timeit("[in_(a, c) for c in check]", **kwds)
0.026218891143798828

Numpy
Given
a = np.arange(12).reshape(3, 2, 2)
lst = [
np.arange(4).reshape(2, 2),
np.arange(4, 8).reshape(2, 2)
]
print(a, *lst, sep='\n{}\n'.format('-' * 20))
[[[ 0 1]
[ 2 3]]
[[ 4 5]
[ 6 7]]
[[ 8 9]
[10 11]]]
--------------------
[[0 1]
[2 3]]
--------------------
[[4 5]
[6 7]]
Notice that lst is a list of arrays as per OP. I'll make that a 3d array b below.
Use broadcasting. Using the broadcasting rules. I want the dimensions of a as (1, 3, 2, 2) and b as (2, 1, 2, 2).
b = np.array(lst)
x, *y = b.shape
c = np.equal(
a.reshape(1, *a.shape),
np.array(lst).reshape(x, 1, *y)
)
I'll use all to produce a (2, 3) array of truth values and np.where to find out which among the a and b sub-arrays are actually equal.
i, j = np.where(c.all((-2, -1)))
This is just a verification that we achieved what we were after. We are supposed to observe that for each paired i and j values, the sub-arrays are actually the same.
for t in zip(i, j):
print(a[t[0]], b[t[1]], sep='\n\n')
print('------')
[[0 1]
[2 3]]
[[0 1]
[2 3]]
------
[[4 5]
[6 7]]
[[4 5]
[6 7]]
------
in
However, to complete OP's thought on using in
a_ = a.tolist()
list(filter(lambda x: x.tolist() in a_, lst))
[array([[0, 1],
[2, 3]]), array([[4, 5],
[6, 7]])]

Numpy Vectorization While Indexing Two Arrays

I'm trying to vectorize the following function using numpy and am completely lost.
A = ndarray: Z x 3
B = ndarray: Z x 3
C = integer
D = ndarray: C x 3
Pseudocode:
entries = []
means = []
For i in range(C):
for p in range(len(B)):
if B[p] == D[i]:
entries.append(A[p])
means.append(columnwise_means(entries))
return means
an example would be :
A = [[1,2,3],[1,2,3],[4,5,6],[4,5,6]]
B = [[9,8,7],[7,6,5],[1,2,3],[3,4,5]]
C = 2
D = [[1,2,3],[4,5,6]]
Returns:
[average([9,8,7],[7,6,5]), average(([1,2,3],[3,4,5])] = [[8,7,6],[2,3,4]]
I've tried using np.where, np.argwhere, np.mean, etc but can't seem to get the desired effect. Any help would be greatly appreciated.
Thanks!

Going by the expected output of the question, I am assuming that in the actual code, you would have :
IF conditional statement as : if A[p] == D[i], and
Entries would be appended from B : entries.append(B[p]).
So, here's one vectorized approach with NumPy broadcasting and dot-product -
mask = (D[:,None,:] == A).all(-1)
out = mask.dot(B)/(mask.sum(1)[:,None])
If the input arrays are integer arrays, then you can save on memory and boost up performance, considering the arrays as indices of a n-dimensional array and thus create the 2D mask without going 3D like so -
dims = np.maximum(A.max(0),D.max(0))+1
mask = np.ravel_multi_index(D.T,dims)[:,None] == np.ravel_multi_index(A.T,dims)
Sample run -
In [107]: A
Out[107]:
array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6]])
In [108]: B
Out[108]:
array([[9, 8, 7],
[7, 6, 5],
[1, 2, 3],
[3, 4, 5]])
In [109]: mask = (D[:,None,:] == A).all(-1)
...: out = mask.dot(B)/(mask.sum(1)[:,None])
...:
In [110]: out
Out[110]:
array([[8, 7, 6],
[2, 3, 4]])

I see two hints :
First, comparing array by rows. A way to do that is to simplify you index system in 1D :
def indexer(M,base=256):
return (M*base**arange(3)).sum(axis=1)
base is an integer > A.max() . Then the selection can be done like that :
indices=np.equal.outer(indexer(D),indexer(A))
for :
array([[ True, True, False, False],
[False, False, True, True]], dtype=bool)
Second, each group can have different length, so vectorisation is difficult for the last step. Here a way to do achieve the job.
B=array(B)
means=[B[i].mean(axis=0) for i in indices]

NumPy selecting specific column index per row by using a list of indexes

I'm struggling to select the specific columns per row of a NumPy matrix.
Suppose I have the following matrix which I would call X:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
I also have a list of column indexes per every row which I would call Y:
[1, 0, 2]
I need to get the values:
[2]
[4]
[9]
Instead of a list with indexes Y, I can also produce a matrix with the same shape as X where every column is a bool / int in the range 0-1 value, indicating whether this is the required column.
[0, 1, 0]
[1, 0, 0]
[0, 0, 1]
I know this can be done with iterating over the array and selecting the column values I need. However, this will be executed frequently on big arrays of data and that's why it has to run as fast as it can.
I was thus wondering if there is a better solution?

If you've got a boolean array you can do direct selection based on that like so:
>>> a = np.array([True, True, True, False, False])
>>> b = np.array([1,2,3,4,5])
>>> b[a]
array([1, 2, 3])
To go along with your initial example you could do the following:
>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> b = np.array([[False,True,False],[True,False,False],[False,False,True]])
>>> a[b]
array([2, 4, 9])
You can also add in an arange and do direct selection on that, though depending on how you're generating your boolean array and what your code looks like YMMV.
>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> a[np.arange(len(a)), [1,0,2]]
array([2, 4, 9])

You can do something like this:
In [7]: a = np.array([[1, 2, 3],
...: [4, 5, 6],
...: [7, 8, 9]])
In [8]: lst = [1, 0, 2]
In [9]: a[np.arange(len(a)), lst]
Out[9]: array([2, 4, 9])
More on indexing multi-dimensional arrays: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays

Recent numpy versions have added a take_along_axis (and put_along_axis) that does this indexing cleanly.
In [101]: a = np.arange(1,10).reshape(3,3)
In [102]: b = np.array([1,0,2])
In [103]: np.take_along_axis(a, b[:,None], axis=1)
Out[103]:
array([[2],
[4],
[9]])
It operates in the same way as:
In [104]: a[np.arange(3), b]
Out[104]: array([2, 4, 9])
but with different axis handling. It's especially aimed at applying the results of argsort and argmax.

A simple way might look like:
In [1]: a = np.array([[1, 2, 3],
...: [4, 5, 6],
...: [7, 8, 9]])
In [2]: y = [1, 0, 2] #list of indices we want to select from matrix 'a'
range(a.shape[0]) will return array([0, 1, 2])
In [3]: a[range(a.shape[0]), y] #we're selecting y indices from every row
Out[3]: array([2, 4, 9])

You can do it by using iterator. Like this:
np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
Time:
N = 1000
X = np.zeros(shape=(N, N))
Y = np.arange(N)
##Aशwini चhaudhary
%timeit X[np.arange(len(X)), Y]
10000 loops, best of 3: 30.7 us per loop
#mine
%timeit np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
1000 loops, best of 3: 1.15 ms per loop
#mine
%timeit np.diag(X.T[Y])
10 loops, best of 3: 20.8 ms per loop

Another clever way is to first transpose the array and index it thereafter. Finally, take the diagonal, its always the right answer.
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
Y = np.array([1, 0, 2, 2])
np.diag(X.T[Y])
Step by step:
Original arrays:
>>> X
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
>>> Y
array([1, 0, 2, 2])
Transpose to make it possible to index it right.
>>> X.T
array([[ 1, 4, 7, 10],
[ 2, 5, 8, 11],
[ 3, 6, 9, 12]])
Get rows in the Y order.
>>> X.T[Y]
array([[ 2, 5, 8, 11],
[ 1, 4, 7, 10],
[ 3, 6, 9, 12],
[ 3, 6, 9, 12]])
The diagonal should now become clear.
>>> np.diag(X.T[Y])
array([ 2, 4, 9, 12]

The answer from hpaulj using take_along_axis should be the accepted one.
Here is a derived version with an N-dim index array:
>>> arr = np.arange(20).reshape((2,2,5))
>>> idx = np.array([[1,0],[2,4]])
>>> np.take_along_axis(arr, idx[...,None], axis=-1)
array([[[ 1],
[ 5]],
[[12],
[19]]])
Note that the selection operation is ignorant about the shapes. I used this to refine a possibly vector-valued argmax result from histogram by fitting parabolas:
def interpol(arr):
i = np.argmax(arr, axis=-1)
a = lambda Δ: np.squeeze(np.take_along_axis(arr, i[...,None]+Δ, axis=-1), axis=-1)
frac = .5*(a(1) - a(-1)) / (2*a(0) - a(-1) - a(1)) # |frac| < 0.5
return i + frac
Note the squeeze to remove the dimension of size 1 resulting in the same shape of i and frac, the integer and fractional part of the peak position.
I'm quite sure that it is possible to avoid the lambda, but would the interpolation formula still look nice?

Numpy broadcast array

I have the following array in NumPy:
A = array([1, 2, 3])
How can I obtain the following matrices (without an explicit loop)?
B = [ 1 1 1
2 2 2
3 3 3 ]
C = [ 1 2 3
1 2 3
1 2 3 ]
Thanks!

Edit2: The OP asks in the comments how to compute
n(i, j) = l(i, i) + l(j, j) - 2 * l(i, j)
I can think of two ways. I like this way because it generalizes easily:
import numpy as np
l=np.arange(9).reshape(3,3)
print(l)
# [[0 1 2]
# [3 4 5]
# [6 7 8]]
The idea is to use np.ogrid. This defines a list of two numpy arrays, one of shape (3,1) and one of shape (1,3):
grid=np.ogrid[0:3,0:3]
print(grid)
# [array([[0],
# [1],
# [2]]), array([[0, 1, 2]])]
grid[0] can be used as a proxy for the index i, and
grid[1] can be used as a proxy for the index j.
So everywhere in the expression l(i, i) + l(j, j) - 2 * l(i, j), you simply replace i-->grid[0], and j-->grid[1], and numpy broadcasting takes care of the rest:
n=l[grid[0],grid[0]] + l[grid[1],grid[1]] + 2*l
print(n)
# [[ 0 6 12]
# [10 16 22]
# [20 26 32]]
However, in this particular case, since l(i,i) and l(j,j) are just the diagonal elements of l, you could do this instead:
d=np.diag(l)
print(d)
# [0 4 8]
d[np.newaxis,:] pumps up the shape of d to (1,3), and
d[:,np.newaxis] pumps up the shape of d to (3,1).
Numpy broadcasting pumps up d[np.newaxis,:] and d[:,np.newaxis] to shape (3,3), copying values as appropriate.
n=d[np.newaxis,:] + d[:,np.newaxis] + 2*l
print(n)
# [[ 0 6 12]
# [10 16 22]
# [20 26 32]]
Edit1: Usually you do not need to form B or C. The purpose of Numpy broadcasting is to allow you to use A in place of B or C. If you show us how you plan to use B or C, we might be able to show you how to do the same with A and numpy broadcasting.
(Original answer):
In [11]: B=A.repeat(3).reshape(3,3)
In [12]: B
Out[12]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [13]: C=B.T
In [14]: C
Out[14]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
or
In [25]: C=np.tile(A,(3,1))
In [26]: C
Out[26]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
In [27]: B=C.T
In [28]: B
Out[28]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
From the dirty tricks department:
In [57]: np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(4,0))
Out[57]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [58]: np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(0,4))
Out[58]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
But note that these are views of A, not copies (as were the solutions above). Changing B, alters A:
In [59]: B=np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(4,0))
In [60]: B[0,0]=100
In [61]: A
Out[61]: array([100, 2, 3])

Very old thread but just in case someone cares...
C,B = np.meshgrid(A,A)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matrix row difference, output a boolean vector - python

Related

iterating via np.nditer function for numpy arrays

Check if 2d array exists in 3d array in Python?

Numpy Vectorization While Indexing Two Arrays

NumPy selecting specific column index per row by using a list of indexes

Numpy broadcast array

Categories

Resources