I have an m x 3 matrix A and its row subset B (n x 3). Both are sets of indices into another, large 4D matrix; their data type is dtype('int64'). I would like to generate a boolean vector x, where x[i] = True if B does not contain row A[i,:].
There are no duplicate rows in either A or B.
I was wondering if there's an efficient way how to do this in Numpy? I found an answer that's somewhat related: https://stackoverflow.com/a/11903368/265289; however, it returns the actual rows (not a boolean vector).
You could follow the same pattern as shown in jterrace's answer, except use np.in1d instead of np.setdiff1d:
import numpy as np
np.random.seed(2015)
m, n = 10, 5
A = np.random.randint(10, size=(m,3))
B = A[np.random.choice(m, n, replace=False)]
print(A)
# [[2 2 9]
# [6 8 5]
# [7 8 0]
# [6 7 8]
# [3 8 6]
# [9 2 3]
# [1 2 6]
# [2 9 8]
# [5 8 4]
# [8 9 1]]
print(B)
# [[2 2 9]
# [1 2 6]
# [2 9 8]
# [3 8 6]
# [9 2 3]]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
print(using_view(A, B, assume_unique=True))
yields
[False True True True False False False False True True]
You can use assume_unique=True (which can speed up the calculation) since
there are no duplicate rows in A or B.
Beware that A.view(...) will raise
ValueError: new type not compatible with array.
if A.flags['C_CONTIGUOUS'] is False (i.e. if A is not a C-contiguous array).
Therefore, in general we need to use np.ascontiguous(A) before calling view.
As B.M. suggests, you could instead view each row using the "void"
dtype:
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd, assume_unique=True)
This is safe to use with integer dtypes. However, note that
In [342]: np.array([-0.], dtype='float64').view('V8') == np.array([0.], dtype='float64').view('V8')
Out[342]: array([False], dtype=bool)
so using np.in1d after viewing as void may return incorrect results for arrays
with float dtype.
Here is a benchmark of some of the proposed methods:
import numpy as np
np.random.seed(2015)
m, n = 10000, 5000
# Note A may contain duplicate rows,
# so don't use assume_unique=True for these benchmarks.
# In this case, using assume_unique=False does not improve the speed much anyway.
A = np.random.randint(10, size=(2*m,3))
# make A not C_CONTIGUOUS; the view methods fail for non-contiguous arrays
A = A[::2]
B = A[np.random.choice(m, n, replace=False)]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
from scipy.spatial import distance
def using_distance(A, B):
return ~np.any(distance.cdist(A,B)==0,1)
from functools import reduce
def using_loop(A, B):
pred = lambda i: A[:, i:i+1] == B[:, i]
return ~reduce(np.logical_and, map(pred, range(A.shape[1]))).any(axis=1)
from pandas.core.groupby import get_group_index, _int64_overflow_possible
from functools import partial
def using_pandas(A, B):
shape = [1 + max(A[:, i].max(), B[:, i].max()) for i in range(A.shape[1])]
assert not _int64_overflow_possible(shape)
encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
a1, b1 = map(encode, (A.T, B.T))
return ~np.in1d(a1, b1)
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd)
# Sanity check: make sure all the functions return the same result
for func in (using_distance, using_loop, using_pandas, using_void):
assert (func(A, B) == using_view(A, B)).all()
In [384]: %timeit using_pandas(A, B)
100 loops, best of 3: 1.99 ms per loop
In [381]: %timeit using_void(A, B)
100 loops, best of 3: 6.72 ms per loop
In [378]: %timeit using_view(A, B)
10 loops, best of 3: 35.6 ms per loop
In [383]: %timeit using_loop(A, B)
1 loops, best of 3: 342 ms per loop
In [379]: %timeit using_distance(A, B)
1 loops, best of 3: 502 ms per loop
since there are only 3 columns, one solution would be to just reduce accross columns:
>>> a
array([[2, 2, 9],
[6, 8, 5],
[7, 8, 0],
[6, 7, 8],
[3, 8, 6],
[9, 2, 3],
[1, 2, 6],
[2, 9, 8],
[5, 8, 4],
[8, 9, 1]])
>>> b
array([[2, 2, 9],
[1, 2, 6],
[2, 9, 8],
[3, 8, 6],
[9, 2, 3]])
>>> from functools import reduce
>>> pred = lambda i: a[:, i:i+1] == b[:,i]
>>> reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
though this would create an m x n intermediate array which may not be memory efficient.
Alternatively, if the values are indices, i.e. non-negative integers, you may use pandas.groupby.get_group_index to reduce to one dimensional arrays. This is an efficient algorithm which pandas use internally for groupby operations; The only caveat is that you may need to verify that there will not be any integer overflow:
>>> from pandas.core.groupby import get_group_index, _int64_overflow_possible
>>> from functools import partial
>>> shape = [1 + max(a[:, i].max(), b[:, i].max()) for i in range(a.shape[1])]
>>> assert not _int64_overflow_possible(shape)
>>> encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
>>> a1, b1 = map(encode, (a.T, b.T))
>>> np.in1d(a1, b1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
You can treat A and B as two sets of XYZ arrays and calculate the euclidean distances between them with scipy.spatial.distance.cdist. The zero distances would be of interest to us. This distance calculation is supposed to be a pretty efficient implementation, so hopefully we would have an efficient solution to solve our case. So, the implementation to find such a boolean output would look like this -
from scipy.spatial import distance
out = ~np.any(distance.cdist(A,B)==0,1)
# OR np.all(distance.cdist(A,B)!=0,1)
Sample run -
In [582]: A
Out[582]:
array([[0, 2, 2],
[1, 0, 3],
[3, 3, 3],
[2, 0, 3],
[2, 0, 1],
[1, 1, 1]])
In [583]: B
Out[583]:
array([[2, 0, 3],
[2, 3, 3],
[1, 1, 3],
[2, 0, 1],
[0, 2, 2],
[2, 2, 2],
[1, 2, 3]])
In [584]: out
Out[584]: array([False, True, True, False, False, True], dtype=bool)
Related
I'm a bit new to python and wanted to check for some values in my arrays to see if they go above or below a certain value and adjust them afterwards.
For the case of a 2d array with numpy I found this in some part of its manual.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for x in np.nditer(arr[:, ::2]):
print(x)
What's the syntax in python to change that initial value so it doesnt iterate over every value starting from the first but from one I can define such iterate from every 2nd or 3rd as I need to check every 1st, 2nd, 3rd and so on value in my arrays against a different value or is there maybe a better way to do this?
I suspect you need to read some more basic numpy docs.
You created a 2d array:
In [5]: arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
In [6]: arr
Out[6]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
You can view it as a 1d array (nditer iterates as flat)
In [7]: arr.ravel()
Out[7]: array([1, 2, 3, 4, 5, 6, 7, 8])
You can use standard slice notation to select everyother - on the flattened:
In [8]: arr.ravel()[::2]
Out[8]: array([1, 3, 5, 7])
or every other column of the original:
In [9]: arr[:,::2]
Out[9]:
array([[1, 3],
[5, 7]])
You can test every value, such as for being odd:
In [10]: arr % 2 == 1
Out[10]:
array([[ True, False, True, False],
[ True, False, True, False]])
and use that array to select those values:
In [11]: arr[arr % 2 == 1]
Out[11]: array([1, 3, 5, 7])
or modify them:
In [12]: arr[arr % 2 == 1] += 10
In [13]: arr
Out[13]:
array([[11, 2, 13, 4],
[15, 6, 17, 8]])
The documentation for nditer tends to overhype it. It's useful in compiled code, but rarely useful in python. If the above whole-array methods don't work, you can iterate directly, with more control and understanding:
In [14]: for row in arr:
...: print(row)
...: for x in row:
...: print(x)
[11 2 13 4]
11
2
13
4
[15 6 17 8]
15
6
17
8
I have an 3d array with shape (1000, 12, 30), and I have a list of 2d array's of shape (12, 30), what I want to do is check if these 2d arrays exist in the 3d array. Is there a simple way in Python to do this? I tried keyword in but it doesn't work.
There is a way in numpy , you can do with np.all
a = np.random.rand(3, 1, 2)
b = a[1][0]
np.all(np.all(a == b, 1), 1)
Out[612]: array([False, True, False])
Solution from bnaecker
np.all(a == b, axis=(1, 2))
If only want to check exit or not
np.any(np.all(a == b, axis=(1, 2)))
Here is a fast method (previously used by #DanielF as well as #jaime and others, no doubt) that uses a trick to benefit from short-circuiting: view-cast template-sized blocks to single elements of dtype void. When comparing two such blocks numpy stops after the first difference, yielding a huge speed advantage.
>>> def in_(data, template):
... dv = data.reshape(data.shape[0], -1).view(f'V{data.dtype.itemsize*np.prod(data.shape[1:])}').ravel()
... tv = template.ravel().view(f'V{template.dtype.itemsize*template.size}').reshape(())
... return (dv==tv).any()
Example:
>>> a = np.random.randint(0, 100, (1000, 12, 30))
>>> check = a[np.random.randint(0, 1000, (10,))]
>>> check += np.random.random(check.shape) < 0.001
>>>
>>> [in_(a, c) for c in check]
[True, True, True, False, False, True, True, True, True, False]
# compare to other method
>>> (a==check[:, None]).all((-1,-2)).any(-1)
array([ True, True, True, False, False, True, True, True, True,
False])
Gives same result as "direct" numpy approach, but is almost 20x faster:
>>> from timeit import timeit
>>> kwds = dict(globals=globals(), number=100)
>>>
>>> timeit("(a==check[:, None]).all((-1,-2)).any(-1)", **kwds)
0.4793281531892717
>>> timeit("[in_(a, c) for c in check]", **kwds)
0.026218891143798828
Numpy
Given
a = np.arange(12).reshape(3, 2, 2)
lst = [
np.arange(4).reshape(2, 2),
np.arange(4, 8).reshape(2, 2)
]
print(a, *lst, sep='\n{}\n'.format('-' * 20))
[[[ 0 1]
[ 2 3]]
[[ 4 5]
[ 6 7]]
[[ 8 9]
[10 11]]]
--------------------
[[0 1]
[2 3]]
--------------------
[[4 5]
[6 7]]
Notice that lst is a list of arrays as per OP. I'll make that a 3d array b below.
Use broadcasting. Using the broadcasting rules. I want the dimensions of a as (1, 3, 2, 2) and b as (2, 1, 2, 2).
b = np.array(lst)
x, *y = b.shape
c = np.equal(
a.reshape(1, *a.shape),
np.array(lst).reshape(x, 1, *y)
)
I'll use all to produce a (2, 3) array of truth values and np.where to find out which among the a and b sub-arrays are actually equal.
i, j = np.where(c.all((-2, -1)))
This is just a verification that we achieved what we were after. We are supposed to observe that for each paired i and j values, the sub-arrays are actually the same.
for t in zip(i, j):
print(a[t[0]], b[t[1]], sep='\n\n')
print('------')
[[0 1]
[2 3]]
[[0 1]
[2 3]]
------
[[4 5]
[6 7]]
[[4 5]
[6 7]]
------
in
However, to complete OP's thought on using in
a_ = a.tolist()
list(filter(lambda x: x.tolist() in a_, lst))
[array([[0, 1],
[2, 3]]), array([[4, 5],
[6, 7]])]
I'm trying to vectorize the following function using numpy and am completely lost.
A = ndarray: Z x 3
B = ndarray: Z x 3
C = integer
D = ndarray: C x 3
Pseudocode:
entries = []
means = []
For i in range(C):
for p in range(len(B)):
if B[p] == D[i]:
entries.append(A[p])
means.append(columnwise_means(entries))
return means
an example would be :
A = [[1,2,3],[1,2,3],[4,5,6],[4,5,6]]
B = [[9,8,7],[7,6,5],[1,2,3],[3,4,5]]
C = 2
D = [[1,2,3],[4,5,6]]
Returns:
[average([9,8,7],[7,6,5]), average(([1,2,3],[3,4,5])] = [[8,7,6],[2,3,4]]
I've tried using np.where, np.argwhere, np.mean, etc but can't seem to get the desired effect. Any help would be greatly appreciated.
Thanks!
Going by the expected output of the question, I am assuming that in the actual code, you would have :
IF conditional statement as : if A[p] == D[i], and
Entries would be appended from B : entries.append(B[p]).
So, here's one vectorized approach with NumPy broadcasting and dot-product -
mask = (D[:,None,:] == A).all(-1)
out = mask.dot(B)/(mask.sum(1)[:,None])
If the input arrays are integer arrays, then you can save on memory and boost up performance, considering the arrays as indices of a n-dimensional array and thus create the 2D mask without going 3D like so -
dims = np.maximum(A.max(0),D.max(0))+1
mask = np.ravel_multi_index(D.T,dims)[:,None] == np.ravel_multi_index(A.T,dims)
Sample run -
In [107]: A
Out[107]:
array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6]])
In [108]: B
Out[108]:
array([[9, 8, 7],
[7, 6, 5],
[1, 2, 3],
[3, 4, 5]])
In [109]: mask = (D[:,None,:] == A).all(-1)
...: out = mask.dot(B)/(mask.sum(1)[:,None])
...:
In [110]: out
Out[110]:
array([[8, 7, 6],
[2, 3, 4]])
I see two hints :
First, comparing array by rows. A way to do that is to simplify you index system in 1D :
def indexer(M,base=256):
return (M*base**arange(3)).sum(axis=1)
base is an integer > A.max() . Then the selection can be done like that :
indices=np.equal.outer(indexer(D),indexer(A))
for :
array([[ True, True, False, False],
[False, False, True, True]], dtype=bool)
Second, each group can have different length, so vectorisation is difficult for the last step. Here a way to do achieve the job.
B=array(B)
means=[B[i].mean(axis=0) for i in indices]
I'm struggling to select the specific columns per row of a NumPy matrix.
Suppose I have the following matrix which I would call X:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
I also have a list of column indexes per every row which I would call Y:
[1, 0, 2]
I need to get the values:
[2]
[4]
[9]
Instead of a list with indexes Y, I can also produce a matrix with the same shape as X where every column is a bool / int in the range 0-1 value, indicating whether this is the required column.
[0, 1, 0]
[1, 0, 0]
[0, 0, 1]
I know this can be done with iterating over the array and selecting the column values I need. However, this will be executed frequently on big arrays of data and that's why it has to run as fast as it can.
I was thus wondering if there is a better solution?
If you've got a boolean array you can do direct selection based on that like so:
>>> a = np.array([True, True, True, False, False])
>>> b = np.array([1,2,3,4,5])
>>> b[a]
array([1, 2, 3])
To go along with your initial example you could do the following:
>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> b = np.array([[False,True,False],[True,False,False],[False,False,True]])
>>> a[b]
array([2, 4, 9])
You can also add in an arange and do direct selection on that, though depending on how you're generating your boolean array and what your code looks like YMMV.
>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> a[np.arange(len(a)), [1,0,2]]
array([2, 4, 9])
You can do something like this:
In [7]: a = np.array([[1, 2, 3],
...: [4, 5, 6],
...: [7, 8, 9]])
In [8]: lst = [1, 0, 2]
In [9]: a[np.arange(len(a)), lst]
Out[9]: array([2, 4, 9])
More on indexing multi-dimensional arrays: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays
Recent numpy versions have added a take_along_axis (and put_along_axis) that does this indexing cleanly.
In [101]: a = np.arange(1,10).reshape(3,3)
In [102]: b = np.array([1,0,2])
In [103]: np.take_along_axis(a, b[:,None], axis=1)
Out[103]:
array([[2],
[4],
[9]])
It operates in the same way as:
In [104]: a[np.arange(3), b]
Out[104]: array([2, 4, 9])
but with different axis handling. It's especially aimed at applying the results of argsort and argmax.
A simple way might look like:
In [1]: a = np.array([[1, 2, 3],
...: [4, 5, 6],
...: [7, 8, 9]])
In [2]: y = [1, 0, 2] #list of indices we want to select from matrix 'a'
range(a.shape[0]) will return array([0, 1, 2])
In [3]: a[range(a.shape[0]), y] #we're selecting y indices from every row
Out[3]: array([2, 4, 9])
You can do it by using iterator. Like this:
np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
Time:
N = 1000
X = np.zeros(shape=(N, N))
Y = np.arange(N)
##Aशwini चhaudhary
%timeit X[np.arange(len(X)), Y]
10000 loops, best of 3: 30.7 us per loop
#mine
%timeit np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
1000 loops, best of 3: 1.15 ms per loop
#mine
%timeit np.diag(X.T[Y])
10 loops, best of 3: 20.8 ms per loop
Another clever way is to first transpose the array and index it thereafter. Finally, take the diagonal, its always the right answer.
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
Y = np.array([1, 0, 2, 2])
np.diag(X.T[Y])
Step by step:
Original arrays:
>>> X
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
>>> Y
array([1, 0, 2, 2])
Transpose to make it possible to index it right.
>>> X.T
array([[ 1, 4, 7, 10],
[ 2, 5, 8, 11],
[ 3, 6, 9, 12]])
Get rows in the Y order.
>>> X.T[Y]
array([[ 2, 5, 8, 11],
[ 1, 4, 7, 10],
[ 3, 6, 9, 12],
[ 3, 6, 9, 12]])
The diagonal should now become clear.
>>> np.diag(X.T[Y])
array([ 2, 4, 9, 12]
The answer from hpaulj using take_along_axis should be the accepted one.
Here is a derived version with an N-dim index array:
>>> arr = np.arange(20).reshape((2,2,5))
>>> idx = np.array([[1,0],[2,4]])
>>> np.take_along_axis(arr, idx[...,None], axis=-1)
array([[[ 1],
[ 5]],
[[12],
[19]]])
Note that the selection operation is ignorant about the shapes. I used this to refine a possibly vector-valued argmax result from histogram by fitting parabolas:
def interpol(arr):
i = np.argmax(arr, axis=-1)
a = lambda Δ: np.squeeze(np.take_along_axis(arr, i[...,None]+Δ, axis=-1), axis=-1)
frac = .5*(a(1) - a(-1)) / (2*a(0) - a(-1) - a(1)) # |frac| < 0.5
return i + frac
Note the squeeze to remove the dimension of size 1 resulting in the same shape of i and frac, the integer and fractional part of the peak position.
I'm quite sure that it is possible to avoid the lambda, but would the interpolation formula still look nice?
I have the following array in NumPy:
A = array([1, 2, 3])
How can I obtain the following matrices (without an explicit loop)?
B = [ 1 1 1
2 2 2
3 3 3 ]
C = [ 1 2 3
1 2 3
1 2 3 ]
Thanks!
Edit2: The OP asks in the comments how to compute
n(i, j) = l(i, i) + l(j, j) - 2 * l(i, j)
I can think of two ways. I like this way because it generalizes easily:
import numpy as np
l=np.arange(9).reshape(3,3)
print(l)
# [[0 1 2]
# [3 4 5]
# [6 7 8]]
The idea is to use np.ogrid. This defines a list of two numpy arrays, one of shape (3,1) and one of shape (1,3):
grid=np.ogrid[0:3,0:3]
print(grid)
# [array([[0],
# [1],
# [2]]), array([[0, 1, 2]])]
grid[0] can be used as a proxy for the index i, and
grid[1] can be used as a proxy for the index j.
So everywhere in the expression l(i, i) + l(j, j) - 2 * l(i, j), you simply replace i-->grid[0], and j-->grid[1], and numpy broadcasting takes care of the rest:
n=l[grid[0],grid[0]] + l[grid[1],grid[1]] + 2*l
print(n)
# [[ 0 6 12]
# [10 16 22]
# [20 26 32]]
However, in this particular case, since l(i,i) and l(j,j) are just the diagonal elements of l, you could do this instead:
d=np.diag(l)
print(d)
# [0 4 8]
d[np.newaxis,:] pumps up the shape of d to (1,3), and
d[:,np.newaxis] pumps up the shape of d to (3,1).
Numpy broadcasting pumps up d[np.newaxis,:] and d[:,np.newaxis] to shape (3,3), copying values as appropriate.
n=d[np.newaxis,:] + d[:,np.newaxis] + 2*l
print(n)
# [[ 0 6 12]
# [10 16 22]
# [20 26 32]]
Edit1: Usually you do not need to form B or C. The purpose of Numpy broadcasting is to allow you to use A in place of B or C. If you show us how you plan to use B or C, we might be able to show you how to do the same with A and numpy broadcasting.
(Original answer):
In [11]: B=A.repeat(3).reshape(3,3)
In [12]: B
Out[12]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [13]: C=B.T
In [14]: C
Out[14]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
or
In [25]: C=np.tile(A,(3,1))
In [26]: C
Out[26]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
In [27]: B=C.T
In [28]: B
Out[28]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
From the dirty tricks department:
In [57]: np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(4,0))
Out[57]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [58]: np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(0,4))
Out[58]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
But note that these are views of A, not copies (as were the solutions above). Changing B, alters A:
In [59]: B=np.lib.stride_tricks.as_strided(A,shape=(3,3),strides=(4,0))
In [60]: B[0,0]=100
In [61]: A
Out[61]: array([100, 2, 3])
Very old thread but just in case someone cares...
C,B = np.meshgrid(A,A)