Use numpy setdiff1d keeping the order - python

a = np.array([1, 2, 3])
b = np.array([4, 2, 3, 1, 0])
c = np.setdiff1d(b, a)
print("c", c)
The result is c [0, 4] but the answer I want is c [4 0].
How can I do that?

Get the mask of non-matches with np.in1d and simply boolean-index into b to retain the order of elements in it -
b[~np.in1d(b,a)]
Sample step-by-step run -
In [14]: a
Out[14]: array([1, 2, 3])
In [15]: b
Out[15]: array([4, 2, 3, 1, 0])
In [16]: ~np.in1d(b,a)
Out[16]: array([ True, False, False, False, True], dtype=bool)
In [17]: b[~np.in1d(b,a)]
Out[17]: array([4, 0])

If you want c to 1) have the elements of b that are not in a and 2) for them to be in the same order as they were in b you can use a list comprehension:
c = np.array([el for el in b if el not in a])
setdiff1d treats the arrays as sets and thus: 1) does not respect the order of the elements, 2) any element present more than once is treated as if it was present only once. For example this code:
a = np.array([4])
b = np.array([2, 4, 2, 1, 1])
c = np.setdiff1d(b, a)
print(c)
Will yield
[1, 2]

Related

How to use conditional statements or loops to achieve my requirement?

I am new to python coding. Kindly, help me to achieve my requirement.
Suppose there are two arrays 'a' and 'b' of size 3*4
a = [[1,0,0,1],
[0,0,1,1],
[1,0,0,1]]
b = [[12,-34,-10,4],
[2,11,-12,20],
[-12,16,19,-9]]
Here, if b[i,j]<10 than I want the corresponding a[i,j] to be same(i.e it can be either 0 or 1) else change a[i,j] element to 1.
Expected outcome for the above example :
c = [[1,0,0,1],
[0,1,1,1],
[1,1,1,1]]
You can use the or | operator:
In [11]: b >= 10
Out[11]:
array([[ True, False, False, False],
[False, True, False, True],
[False, True, True, False]])
In [12]: a | (b >= 10)
Out[12]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
The | is a bitwise or and is equivalent to np.bitwise_or:
In [13]: np.bitwise_or(a, b >= 10)
Out[13]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
This assumes both a and b are numpy arrays, you can make this so with the array constructor:
a, b = np.array(a), np.array(b)
if you do not want to use numpy you could do this nested list-comprehension:
c = [[el_a | (el_b >= 10) for el_a, el_b in zip(row_a, row_b)]
for row_a, row_b in zip(a, b)]
but i prefer Andy Hayden's anser. numpy really shines for that kind of operations.

Permute within a row in python

I have two arrays that are paired meaning that element 1 in both arrays needs to have the same index. I want to permute these elements. Currently, I tried np.random.permutation but that does not seem to get the right answer.
For example, if the two arrays are [1,2,3] and [4,5,6], one possible permutation would be [4,2,3] and [1,5,6].
You can stack your arrays and choose a random column for each row using choice.
Setup
a = np.array([1,2,3])
b = np.array([4,5,6])
v = np.column_stack((a,b))
# array([[1, 4],
# [2, 5],
# [3, 6]])
np.random.seed(1)
choices = np.random.choice(v.shape[1], v.shape[0])
# array([1, 1, 0])
Finally, to index:
v[np.arange(v.shape[0]), choices]
array([4, 5, 3])
a=np.array([1, 2, 3])
b=np.array([4, 5, 6])
random_arr=np.random.choice([0, 1], size=(len(a),)) # Generate a random array of 0s and 1s, let's say arr([0,0,1])
a1=random_arr*a + (1-random_arr)*b # arr([0,0,1])*arr([1,2,3]) + arr([1,1,0])*arr([4,5,6]) = arr([4, 5, 3])
b1=random_arr*b + (1-random_arr)*a # arr([0,0,1])*arr([4,5,6]) + arr([1,1,0])*arr([1,2,3]) = arr([1, 2, 6])
a=a1
b=b1
Run 1 of the code above:
a
Out[188]: array([4, 2, 6])
b
Out[189]: array([1, 5, 3])
Run 2:
a
Out[191]: array([4, 5, 3])
b
Out[192]: array([1, 2, 6])
You can use np.choose :
toss=np.random.randint(0,2,len(x))
print(np.choose(toss,[x,y]))
print(np.choose(toss,[y,x]))
#[1 5 6]
#[4 2 3]

Remove consecutive duplicates in a NumPy array

I would like to remove duplicates which follow each other, but not duplicates along the whole array. Also, I want to keep the ordering unchanged.
So if the input is [0 0 1 3 2 2 3 3] the output should be [0 1 3 2 3]
I found a way using itertools.groupby() but I am looking for a faster NumPy solution.
a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])
The general idea is to use diff to find the difference between two consecutive elements in the array. Then we only index those which give non-zero differences elements. But since the length of diff is shorter by 1. So before indexing, we need to insert the True to the beginning of the diff array.
Explanation:
In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])
In [101]: diff = np.diff(a).astype(np.bool)
In [102]: diff
Out[102]: array([False, True, True, True, False, True, False], dtype=bool)
In [103]: idx = np.insert(diff, 0, True)
In [104]: idx
Out[104]: array([ True, False, True, True, True, False, True, False], dtype=bool)
In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])
For pure python wich also works with numpy arrays use this:
def modify(l):
last = None
for e in l:
if e != last:
yield e
last = e
pure = modify([0, 0, 1, 3, 2, 2, 3, 3])
import numpy
num = numpy.array(modify(numpy.array([0, 0, 1, 3, 2, 2, 3, 3])))
I don't know if there are any numpy functions wich would speed this up.
For NumPy version >= 1.16.0 you can use the prepend argument:
a[np.diff(a, prepend=np.nan).astype(bool)]

How can I pad and/or truncate a vector to a specified length using numpy?

I have couple of lists:
a = [1,2,3]
b = [1,2,3,4,5,6]
which are of variable length.
I want to return a vector of length five, such that if the input list length is < 5 then it will be padded with zeros on the right, and if it is > 5, then it will be truncated at the 5th element.
For example, input a would return np.array([1,2,3,0,0]), and input b would return np.array([1,2,3,4,5]).
I feel like I ought to be able to use np.pad, but I can't seem to follow the documentation.
This might be slow or fast, I am not sure, however it works for your purpose.
In [22]: pad = lambda a,i : a[0:i] if len(a) > i else a + [0] * (i-len(a))
In [23]: pad([1,2,3], 5)
Out[23]: [1, 2, 3, 0, 0]
In [24]: pad([1,2,3,4,5,6,7], 5)
Out[24]: [1, 2, 3, 4, 5]
np.pad is overkill, better for adding a border all around a 2d image than adding some zeros to a list.
I like the zip_longest, especially if the inputs are lists, and don't need to be arrays. It's probably the closest you'll find to a code that operates on all lists at once in compiled code).
a, b = zip(*list(itertools.izip_longest(a, b, fillvalue=0)))
is a version that does not use np.array at all (saving some array overhead)
But by itself it does not truncate. It stills something like [x[:5] for x in (a,b)].
Here's my variation on all_ms function, working with a simple list or 1d array:
def foo_1d(x, n=5):
x = np.asarray(x)
assert x.ndim==1
s = np.min([x.shape[0], n])
ret = np.zeros((n,), dtype=x.dtype)
ret[:s] = x[:s]
return ret
In [772]: [foo_1d(x) for x in [[1,2,3], [1,2,3,4,5], np.arange(10)[::-1]]]
Out[772]: [array([1, 2, 3, 0, 0]), array([1, 2, 3, 4, 5]), array([9, 8, 7, 6, 5])]
One way or other the numpy solutions do the same thing - construct a blank array of the desired shape, and then fill it with the relevant values from the original.
One other detail - when truncating the solution could, in theory, return a view instead of a copy. But that requires handling that case separately from a pad case.
If the desired output is a list of equal lenth arrays, it may be worth while collecting them in a 2d array.
In [792]: def foo1(x, out):
x = np.asarray(x)
s = np.min((x.shape[0], out.shape[0]))
out[:s] = x[:s]
In [794]: lists = [[1,2,3], [1,2,3,4,5], np.arange(10)[::-1], []]
In [795]: ret=np.zeros((len(lists),5),int)
In [796]: for i,xx in enumerate(lists):
foo1(xx, ret[i,:])
In [797]: ret
Out[797]:
array([[1, 2, 3, 0, 0],
[1, 2, 3, 4, 5],
[9, 8, 7, 6, 5],
[0, 0, 0, 0, 0]])
Pure python version, where a is a python list (not a numpy array): a[:n] + [0,]*(n-len(a)).
For example:
In [42]: n = 5
In [43]: a = [1, 2, 3]
In [44]: a[:n] + [0,]*(n - len(a))
Out[44]: [1, 2, 3, 0, 0]
In [45]: a = [1, 2, 3, 4]
In [46]: a[:n] + [0,]*(n - len(a))
Out[46]: [1, 2, 3, 4, 0]
In [47]: a = [1, 2, 3, 4, 5]
In [48]: a[:n] + [0,]*(n - len(a))
Out[48]: [1, 2, 3, 4, 5]
In [49]: a = [1, 2, 3, 4, 5, 6]
In [50]: a[:n] + [0,]*(n - len(a))
Out[50]: [1, 2, 3, 4, 5]
Function using numpy:
In [121]: def tosize(a, n):
.....: a = np.asarray(a)
.....: x = np.zeros(n, dtype=a.dtype)
.....: m = min(n, len(a))
.....: x[:m] = a[:m]
.....: return x
.....:
In [122]: tosize([1, 2, 3], 5)
Out[122]: array([1, 2, 3, 0, 0])
In [123]: tosize([1, 2, 3, 4], 5)
Out[123]: array([1, 2, 3, 4, 0])
In [124]: tosize([1, 2, 3, 4, 5], 5)
Out[124]: array([1, 2, 3, 4, 5])
In [125]: tosize([1, 2, 3, 4, 5, 6], 5)
Out[125]: array([1, 2, 3, 4, 5])

Matrix row difference, output a boolean vector

I have an m x 3 matrix A and its row subset B (n x 3). Both are sets of indices into another, large 4D matrix; their data type is dtype('int64'). I would like to generate a boolean vector x, where x[i] = True if B does not contain row A[i,:].
There are no duplicate rows in either A or B.
I was wondering if there's an efficient way how to do this in Numpy? I found an answer that's somewhat related: https://stackoverflow.com/a/11903368/265289; however, it returns the actual rows (not a boolean vector).
You could follow the same pattern as shown in jterrace's answer, except use np.in1d instead of np.setdiff1d:
import numpy as np
np.random.seed(2015)
m, n = 10, 5
A = np.random.randint(10, size=(m,3))
B = A[np.random.choice(m, n, replace=False)]
print(A)
# [[2 2 9]
# [6 8 5]
# [7 8 0]
# [6 7 8]
# [3 8 6]
# [9 2 3]
# [1 2 6]
# [2 9 8]
# [5 8 4]
# [8 9 1]]
print(B)
# [[2 2 9]
# [1 2 6]
# [2 9 8]
# [3 8 6]
# [9 2 3]]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
print(using_view(A, B, assume_unique=True))
yields
[False True True True False False False False True True]
You can use assume_unique=True (which can speed up the calculation) since
there are no duplicate rows in A or B.
Beware that A.view(...) will raise
ValueError: new type not compatible with array.
if A.flags['C_CONTIGUOUS'] is False (i.e. if A is not a C-contiguous array).
Therefore, in general we need to use np.ascontiguous(A) before calling view.
As B.M. suggests, you could instead view each row using the "void"
dtype:
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd, assume_unique=True)
This is safe to use with integer dtypes. However, note that
In [342]: np.array([-0.], dtype='float64').view('V8') == np.array([0.], dtype='float64').view('V8')
Out[342]: array([False], dtype=bool)
so using np.in1d after viewing as void may return incorrect results for arrays
with float dtype.
Here is a benchmark of some of the proposed methods:
import numpy as np
np.random.seed(2015)
m, n = 10000, 5000
# Note A may contain duplicate rows,
# so don't use assume_unique=True for these benchmarks.
# In this case, using assume_unique=False does not improve the speed much anyway.
A = np.random.randint(10, size=(2*m,3))
# make A not C_CONTIGUOUS; the view methods fail for non-contiguous arrays
A = A[::2]
B = A[np.random.choice(m, n, replace=False)]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
from scipy.spatial import distance
def using_distance(A, B):
return ~np.any(distance.cdist(A,B)==0,1)
from functools import reduce
def using_loop(A, B):
pred = lambda i: A[:, i:i+1] == B[:, i]
return ~reduce(np.logical_and, map(pred, range(A.shape[1]))).any(axis=1)
from pandas.core.groupby import get_group_index, _int64_overflow_possible
from functools import partial
def using_pandas(A, B):
shape = [1 + max(A[:, i].max(), B[:, i].max()) for i in range(A.shape[1])]
assert not _int64_overflow_possible(shape)
encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
a1, b1 = map(encode, (A.T, B.T))
return ~np.in1d(a1, b1)
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd)
# Sanity check: make sure all the functions return the same result
for func in (using_distance, using_loop, using_pandas, using_void):
assert (func(A, B) == using_view(A, B)).all()
In [384]: %timeit using_pandas(A, B)
100 loops, best of 3: 1.99 ms per loop
In [381]: %timeit using_void(A, B)
100 loops, best of 3: 6.72 ms per loop
In [378]: %timeit using_view(A, B)
10 loops, best of 3: 35.6 ms per loop
In [383]: %timeit using_loop(A, B)
1 loops, best of 3: 342 ms per loop
In [379]: %timeit using_distance(A, B)
1 loops, best of 3: 502 ms per loop
since there are only 3 columns, one solution would be to just reduce accross columns:
>>> a
array([[2, 2, 9],
[6, 8, 5],
[7, 8, 0],
[6, 7, 8],
[3, 8, 6],
[9, 2, 3],
[1, 2, 6],
[2, 9, 8],
[5, 8, 4],
[8, 9, 1]])
>>> b
array([[2, 2, 9],
[1, 2, 6],
[2, 9, 8],
[3, 8, 6],
[9, 2, 3]])
>>> from functools import reduce
>>> pred = lambda i: a[:, i:i+1] == b[:,i]
>>> reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
though this would create an m x n intermediate array which may not be memory efficient.
Alternatively, if the values are indices, i.e. non-negative integers, you may use pandas.groupby.get_group_index to reduce to one dimensional arrays. This is an efficient algorithm which pandas use internally for groupby operations; The only caveat is that you may need to verify that there will not be any integer overflow:
>>> from pandas.core.groupby import get_group_index, _int64_overflow_possible
>>> from functools import partial
>>> shape = [1 + max(a[:, i].max(), b[:, i].max()) for i in range(a.shape[1])]
>>> assert not _int64_overflow_possible(shape)
>>> encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
>>> a1, b1 = map(encode, (a.T, b.T))
>>> np.in1d(a1, b1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
You can treat A and B as two sets of XYZ arrays and calculate the euclidean distances between them with scipy.spatial.distance.cdist. The zero distances would be of interest to us. This distance calculation is supposed to be a pretty efficient implementation, so hopefully we would have an efficient solution to solve our case. So, the implementation to find such a boolean output would look like this -
from scipy.spatial import distance
out = ~np.any(distance.cdist(A,B)==0,1)
# OR np.all(distance.cdist(A,B)!=0,1)
Sample run -
In [582]: A
Out[582]:
array([[0, 2, 2],
[1, 0, 3],
[3, 3, 3],
[2, 0, 3],
[2, 0, 1],
[1, 1, 1]])
In [583]: B
Out[583]:
array([[2, 0, 3],
[2, 3, 3],
[1, 1, 3],
[2, 0, 1],
[0, 2, 2],
[2, 2, 2],
[1, 2, 3]])
In [584]: out
Out[584]: array([False, True, True, False, False, True], dtype=bool)

Categories