Removing matching elements from two numpy arrays - python

Consider two sorted numpy arrays:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
How do I:
1. Find the elements that appear in both lists, and
2. Remove only one instance of that occurrence from each list.
That is the output should be:
a = [1,2,4,8,10,21]
b = [3,3,18,22]
So even if there are duplicates, only one instance is removed. However if the lists are
c = np.array([1,2,4,4,6,8,10,10,10,21])
d = np.array([3,3,4,6,10,10,18,22])
I expect to obtain the new outputs:
c = [1,2,4,8,10,21]
d = [3,3,18,22]
which is the same as above. The difference is the number of 10's in the list. Each of the two 10's in list d takes away one 10 each from c leaving the same result.
This post was the closest match to my question, but it removed all instances of repeats from both lists.

You can use collections.Counter:
from collections import Counter
import numpy as np
a = np.array([1, 2, 4, 4, 6, 8, 10, 10, 21])
b = np.array([3, 3, 4, 6, 10, 18, 22])
ca = Counter(a)
cb = Counter(b)
result_a = sorted((ca - cb).elements())
result_b = sorted((cb - ca).elements())
print(result_a)
print(result_b)
Output
[1, 2, 4, 8, 10, 21]
[3, 3, 18, 22]
It returns the same result for (as expected):
a = np.array([1, 2, 4, 4, 6, 8, 10, 10, 10, 21])
b = np.array([3, 3, 4, 6, 10, 10, 18, 22])

You can find the indices of first occurences of intersecting items using np.searchsorted as following and then remove them using np.delete() function:
In [58]: intersect = a[np.in1d(a, b)]
In [59]: mask1 = np.searchsorted(a, intersect)
In [60]: mask2 = np.searchsorted(b, intersect)
In [61]: np.delete(a, mask1)
Out[61]: array([ 1, 2, 4, 8, 10, 21])
In [62]: np.delete(b, mask2)
Out[62]: array([ 3, 3, 18, 22])

I'm not 100% sure what you're looking to do based on the question, but I have been able to duplicate the output using the methods described.
import numpy as np
# List of b that are not in a
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
newb = [x for x in b if x not in a]
print(newb)
# REMOVE ONE DUPLICATED ELEMENT FROM LIST
import collections
counter=collections.Counter(a)
print(counter)
newa = list(a)
for k,v in counter.items():
if v > 1:
newa.remove(k)
print(newa)

If you don't mind the verbosity:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
common_values = set(a) & set(b)
a = a.tolist()
b = b.tolist()
for value in common_values:
a.remove(value)
b.remove(value)
a = np.array(a)
b = np.array(b)

Using for loops:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
for i, val in enumerate(a):
if val in b:
a = np.delete(a, np.where(a == val)[0][0])
b = np.delete(b, np.where(b == val)[0][0])
for i, val in enumerate(b):
if val in a:
a = np.delete(a, np.where(a == val)[0][0])
b = np.delete(b, np.where(b == val)[0][0])
print(a)
print(b)
Outputs:
[1,2,4,8,10,21]
[3,3,18,22]

Here is a numpy approach:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
# join and sort (with Tim sort this should be O(n))
ab = np.concatenate([a,b])
i = ab.argsort(kind="stable")
abo = ab[i]
# mark 1st of each group of equal values
d = np.flatnonzero(np.diff(abo,prepend=abo[0]-1,append=abo[-1]+1))
# mark sorted total by origin (a -> False, b -> True)
ig = i>=len(a)
# compare origins of first and last of each group of equal values
# if they are different mark for deletion
dupl = ig[d[:-1]] ^ ig[d[1:]-1]
# finally, delete
ar = np.delete(a,i[d[:-1][dupl]])
br = np.delete(b,i[d[1:][dupl]-1]-len(a))
# inspect
ar
array([ 1, 2, 4, 8, 10, 21])
br
array([ 3, 3, 18, 22])

Related

How to efficiently shuffle some values of a numpy array while keeping their relative order?

I have a numpy array and a mask specifying which entries from that array to shuffle while keeping their relative order. Let's have an example:
In [2]: arr = np.array([5, 3, 9, 0, 4, 1])
In [4]: mask = np.array([True, False, False, False, True, True])
In [5]: arr[mask]
Out[5]: array([5, 4, 1]) # These entries shall be shuffled inside arr, while keeping their order.
In [6]: np.where(mask==True)
Out[6]: (array([0, 4, 5]),)
In [7]: shuffle_array(arr, mask) # I'm looking for an efficient realization of this function!
Out[7]: array([3, 5, 4, 9, 0, 1]) # See how the entries 5, 4 and 1 haven't changed their order.
I've written some code that can do this, but it's really slow.
import numpy as np
def shuffle_array(arr, mask):
perm = np.arange(len(arr)) # permutation array
n = mask.sum()
if n > 0:
old_true_pos = np.where(mask == True)[0] # old positions for which mask is True
old_false_pos = np.where(mask == False)[0] # old positions for which mask is False
new_true_pos = np.random.choice(perm, n, replace=False) # draw new positions
new_true_pos.sort()
new_false_pos = np.setdiff1d(perm, new_true_pos)
new_pos = np.hstack((new_true_pos, new_false_pos))
old_pos = np.hstack((old_true_pos, old_false_pos))
perm[new_pos] = perm[old_pos]
return arr[perm]
To make things worse, I actually have two large matrices A and B with shape (M,N). Matrix A holds arbitrary values, while each row of matrix B is the mask which to use for shuffling one corresponding row of matrix A according to the procedure that I outlined above. So what I want is shuffled_matrix = row_wise_shuffle(A, B).
The only way I have so far found to do it is via my shuffle_array() function and a for loop.
Can you think of any numpy'onic way to accomplish this task avoiding loops? Thank you so much in advance!
For 1d case:
import numpy as np
a = np.arange(8)
b = np.array([1,1,1,1,0,0,0,0])
# Get ordered values
ordered_values = a[np.where(b==1)]
# We'll shuffle both arrays
shuffled_ix = np.random.permutation(a.shape[0])
a_shuffled = a[shuffled_ix]
b_shuffled = b[shuffled_ix]
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = ordered_values
a_shuffled # Notice that 0, 1, 2, 3 preserves order.
>>>
array([0, 1, 2, 6, 3, 4, 7, 5])
for 2d case, columnwise shuffle (along axis=1):
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# Get ordered values
i,j = np.where(b==1)
values = a[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*a.shape).argsort(axis=1)
a_shuffled = np.take_along_axis(a,idx,axis=1)
b_shuffled = np.take_along_axis(b,idx,axis=1)
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = values
# Get the result
a_shuffled # see that 4,5 | 6,7,8 | 12,13,14,15 | 20, 21 preserves order
>>>
array([[ 4, 1, 0, 3, 2, 5],
[ 9, 6, 7, 11, 8, 10],
[12, 13, 16, 17, 14, 15],
[23, 20, 19, 22, 21, 18]])
for 2d case, rowwise shuffle (along axis=0), we can use the same code, first transpose arrays and after shuffle transpose back:
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# As you said rowwise, we first transpose
at = a.T
bt = b.T
# Get ordered values
i,j = np.where(bt==1)
values = at[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*at.shape).argsort(axis=1)
at_shuffled = np.take_along_axis(at,idx,axis=1)
bt_shuffled = np.take_along_axis(bt,idx,axis=1)
# Replace the values with correct order
at_shuffled[np.where(bt_shuffled==1)] = values
# Get the result
a_shuffled = at_shuffled.T
a_shuffled # see that 6,12 | 7, 13 | 8,14,20 | 15, 21 preserves order
>>>
array([[ 6, 7, 2, 3, 10, 17],
[18, 19, 8, 15, 16, 23],
[12, 13, 14, 21, 4, 5],
[ 0, 1, 20, 9, 22, 11]])

Numpy array insert every second element from second array

I have two arrays of the same shape and now want to combine them by making every odd element and 0 one of the first array and every even one of the second array in the same order.
E.g.:
a = ([0,1,3,5])
b = ([2,4,6])
c = ([0,1,2,3,4,5,6])
I tried something including modulo to identify uneven indices:
a = ([0,1,3,5])
b = ([2,4,6])
c = a
i = 0
j = 2
l = 0
for i in range(1,22):
k = (i+j) % 2
if k > 0:
c = np.insert(c, i, b[l])
l+=1
else:
continue
I guess there is some easier/faster slicing option, but can't figure it out.
np.insert would work well:
>>> A = np.array([1, 3, 5, 7])
>>> B = np.array([2, 4, 6, 8])
>>> np.insert(B, np.arange(len(A)), A)
array([1, 2, 3, 4, 5, 6, 7, 8])
However, if you don't rely on sorted values, try this:
>>> A = np.array([5, 3, 1])
>>> B = np.array([1, 2, 3])
>>> C = [ ]
>>> for element in zip(A, B):
C.extend(element)
>>> C
[5, 1, 3, 2, 1, 3]
read the documentation of the range
for i in range(0,10,2):
print(i)
will print [0,2,4,6,8]
From what I understand, the first element in a is always first the rest are just intereleaved. If that is the case, then some clever use of stacking and reshaping is probably enough.
a = np.array([0,1,3,5])
b = np.array([2,4,6])
c = np.hstack([a[:1], np.vstack([a[1:], b]).T.reshape((-1, ))])
You could try something like this
import numpy as np
A = [0,1,3,5]
B = [2,4,6]
lst = np.zeros(len(A)+len(B))
lst[0]=A[0]
lst[1::2] = A[1:]
lst[2::2] = B
Even though I don't understand why you would make it so complicated

Splitting arrays depending on unique values in an array

I currently have two arrays, one of which has several repeated values and another with unique values.
Eg array 1 : a = [1, 1, 2, 2, 3, 3]
Eg array 2 : b = [10, 11, 12, 13, 14, 15]
I was developing a code in python that looks at the first array and distinguishes the elements that are all the same and remembers the indices. A new array is created that contains the elements of array b at those indices.
Eg: As array 'a' has three unique values at positions 1,2... 3,4... 5,6, then three new arrays would be created such that it contains the elements of array b at positions 1,2... 3,4... 5,6. Thus, the result would be three new arrays:
b1 = [10, 11]
b2 = [12, 13]
b3 = [14, 15]
I have managed to develop a code, however, it only works for when there are three unique values in array 'a'. In the case there are more or less unique values in array 'a', the code has to be physically modified.
import itertools
import numpy as np
import matplotlib.tri as tri
import sys
a = [1, 1, 2, 2, 3, 3]
b = [10, 10, 20, 20, 30, 30]
b_1 = []
b_2 = []
b_3 = []
unique = []
for vals in a:
if vals not in unique:
unique.append(vals)
if len(unique) != 3:
sys.exit("More than 3 'a' values - check dimension")
for j in range(0,len(a)):
if a[j] == unique[0]:
b_1.append(c[j])
elif a[j] == unique[1]:
b_2.append(c[j])
elif a[j] == unique[2]:
b_3.append(c[j])
else:
sys.exit("More than 3 'a' values - check dimension")
print (b_1)
print (b_2)
print (b_3)
I was wondering if there is perhaps a more elegant way to perform this task such that the code is able to cope with an n number of unique values.
Well given that you are also using numpy, here's one way using np.unique. You can set return_index=True to get the indices of the unique values, and use them to split the array b with np.split:
a = np.array([1, 1, 2, 2, 3, 3])
b = np.array([10, 11, 12, 13, 14, 15])
u, s = np.unique(a, return_index=True)
np.split(b,s[1:])
Output
[array([10, 11]), array([12, 13]), array([14, 15])]
You can use the function groupby():
from itertools import groupby
from operator import itemgetter
a = [1, 1, 2, 2, 3, 3]
b = [10, 11, 12, 13, 14, 15]
[[i[1] for i in g] for _, g in groupby(zip(a, b), key=itemgetter(0))]
# [[10, 11], [12, 13], [14, 15]]

what is the most pythonic way to split a 2d array to arrays of each row?

I have a function foo that returns an array with the shape (1000, 2)
how can I split it to two arrays a(1000) and b(1000)
I'm looking for something like this:
a;b = foo()
I'm looking for an answer that can easily generalize to the case in which the shape is (1000, 5) or so.
The zip(*...) idiom transposes a traditional more-dimensional Python list:
x = [[1,2], [3,4], [5,6]]
# get columns
a, b = zip(*x) # zip(*foo())
# a, b = map(list, zip(*x)) # if you prefer lists over tuples
a
# (1, 3, 5)
# get rows
a, b, c = x
a
# [1, 2]
Transpose and unpack?
a, b = foo().T
>>> a, b = np.arange(20).reshape(-1, 2).T
>>> a
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
>>> b
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
You can use numpy.hsplit.
x = np.arange(12).reshape((3, 4))
np.hsplit(x, x.shape[1])
This returns a list of subarrays. Note that in the case of a 2d input, the subarrays will be shape (n, 1). Unless you wrap a function around it to squeeze them to 1d:
def split_1d(arr_2d):
"""Split 2d NumPy array on its columns."""
split = np.hsplit(arr_2d, arr_2d.shape[1])
split = [np.squeeze(arr) for arr in split]
return split
a, b, c, d = split_1d(x)
a
# array([0, 4, 8])
d
# array([ 3, 7, 11])
You could just use list comprehensions, e.g.
(a,b)=([i[0] for i in mylist],[i[1] for i in mylist])
To generalise you could use a comprehension within a comprehension:
(a,b,c,d,e)=([row[i] for row in mylist] for i in range(5))
You can do this simply by using zip function like:
def foo(mylist):
return zip(*mylist)
Now call foo with as much dimension as you have in mylist, and it would do the requisite like:
mylist = [[1, 2], [3, 4], [5, 6]]
a, b = foo(mylist)
# a = (1, 3, 5)
# b = (2, 4, 6)
So this is a little nuts, but if you want to assign different letters to each sub-array in your array, and do so for any number of sub-arrays (up to 26 because alphabet), you could do:
import string
letters = list(string.ascii_lowercase) # get all of the lower-case letters
arr_dict = {k: v for k, v in zip(letters, foo())}
or more simply (for the last line):
arr_dict = dict(zip(letters, foo()))
Then you can access each individual element as arr_dict['a'] or arr_dict['b']. This feels a little mad-scientist-ey to me, but I thought it was fun.

Python equivalent of R "split"-function

In R, you could split a vector according to the factors of another vector:
> a <- 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> b <- rep(1:2,5)
[1] 1 2 1 2 1 2 1 2 1 2
> split(a,b)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
Thus, grouping a list (in terms of python) according to the values of another list (according to the order of the factors).
Is there anything handy in python like that, except from the itertools.groupby approach?
From your example, it looks like each element in b contains the 1-indexed list in which the node will be stored. Python lacks the automatic numeric variables that R seems to have, so we'll return a tuple of lists. If you can do zero-indexed lists, and you only need two lists (i.e., for your R use case, 1 and 2 are the only values, in python they'll be 0 and 1)
>>> a = range(1, 11)
>>> b = [0,1] * 5
>>> split(a, b)
([1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
Then you can use itertools.compress:
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
If you need more general input (multiple numbers), something like the following will return an n-tuple:
def split(x, f):
count = max(f) + 1
return tuple( list(itertools.compress(x, (el == i for el in f))) for i in xrange(count) )
>>> split([1,2,3,4,5,6,7,8,9,10], [0,1,1,0,2,3,4,0,1,2])
([1, 4, 8], [2, 3, 9], [5, 10], [6], [7])
Edit: warning, this a groupby solution, which is not what OP asked for, but it may be of use to someone looking for a less specific way to split the R way in Python.
Here's one way with itertools.
import itertools
# make your sample data
a = range(1,11)
b = zip(*zip(range(len(a)), itertools.cycle((1,2))))[1]
{k: zip(*g)[1] for k, g in itertools.groupby(sorted(zip(b,a)), lambda x: x[0])}
# {1: (1, 3, 5, 7, 9), 2: (2, 4, 6, 8, 10)}
This gives you a dictionary, which is analogous to the named list that you get from R's split.
As a long time R user I was wondering how to do the same thing. It's a very handy function for tabulating vectors. This is what I came up with:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
from collections import defaultdict
def split(x, f):
res = defaultdict(list)
for v, k in zip(x, f):
res[k].append(v)
return res
>>> split(a, b)
defaultdict(list, {1: [1, 3, 5, 7, 9], 2: [2, 4, 6, 8, 10]})
You could try:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
split_1 = [a[k] for k in (i for i,j in enumerate(b) if j == 1)]
split_2 = [a[k] for k in (i for i,j in enumerate(b) if j == 2)]
results in:
In [22]: split_1
Out[22]: [1, 3, 5, 7, 9]
In [24]: split_2
Out[24]: [2, 4, 6, 8, 10]
To make this generalise you can simply iterate over the unique elements in b:
splits = {}
for index in set(b):
splits[index] = [a[k] for k in (i for i,j in enumerate(b) if j == index)]

Categories