choose rows from two matrices - python

I am trying to solve the following problem. I have two matrices A and B and I want to create a new matrix C which consists of the rows of the matrices A and B depending on some condition which is encoded in the array v, i.e. if the i'th entry of v is a one then I want the i'th row of C to be the i'th row of B and if it is a zero then it should be the i'th row of A. I came up with the following solution
C = np.choose(v,A.T,B.T).T
but it is too slow. One obvious bad thing are the two transposes, but since np.choose does not take an axis argument I don't know how to get rid of them. Any ideas for a fast solution of this problem?
For Example let
A = np.arange(20).reshape([4,5])
and
B = 10 - A
Then one could imagine that one wants the matrix C to be the matrix of rows with smallest maximum norm. So we let
v = np.sum(A,axis=1)<np.sum(B,axis=1)
and then C is the matrix
C = np.choose(v,[A.T,B.T]).T
which is
array([[10, 9, 8, 7, 6],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])

Seems like a good setup to use np.where to do the chosing operation based on the mask/binary input data -
C = np.where(v[:,None],B,A)
That v[:,None] part basically extends v to broadcastable shape as A and B allowing the broadcasting to let chosing work along the appropriate axis, axis=0 in this case for the two 2D arrays.
Sample run -
In [58]: A
Out[58]:
array([[82, 78, 57],
[14, 97, 32],
[72, 11, 49],
[98, 34, 41],
[89, 71, 52],
[34, 51, 55],
[26, 92, 59]])
In [59]: B
Out[59]:
array([[55, 67, 50],
[49, 64, 21],
[34, 18, 72],
[24, 61, 65],
[56, 59, 23],
[44, 77, 13],
[56, 55, 58]])
In [62]: v
Out[62]: array([1, 0, 0, 0, 0, 1, 1])
In [63]: np.where(v[:,None],B,A)
Out[63]:
array([[55, 67, 50],
[14, 97, 32],
[72, 11, 49],
[98, 34, 41],
[89, 71, 52],
[44, 77, 13],
[56, 55, 58]])
If v doesn't strictly consist of 0s and 1s only, use v[:,None]==1 as the first argument with np.where.
Another approach would be with boolean-indexing -
C = A.copy()
mask = v==1
C[mask] = B[mask]
Note : If v is already a boolean array, skip the comparison against 1 for the mask creation.
Runtime test -
In [77]: A = np.random.randint(11,99,(10000,3))
In [78]: B = np.random.randint(11,99,(10000,3))
In [79]: v = np.random.rand(A.shape[0])>0.5
In [82]: def choose_rows_copy(A, B, v):
...: C = A.copy()
...: C[v] = B[v]
...: return C
...:
In [83]: %timeit np.where(v[:,None],B,A)
10000 loops, best of 3: 107 µs per loop
In [84]: %timeit choose_rows_copy(A, B, v)
1000 loops, best of 3: 226 µs per loop

Related

How to vectorize a numpy for loop that has a multiple indexed access

unigram is an array shape (N, M, 100)
I would like to remove the for loop and perform all the calculations.
seq is a 1D array of size M, and the size of M maybe up to 10000.
I would like to remove the for loop and vectorize it for easier computation.
batch_size, seq_len, num_labels = unigram_scores.shape
broadcast = np.broadcast_to(seq, (batch_size, seq_len))
for i in range(0, broadcast.shape[1]):
n_seq[i] = unigram_scores[np.arange(batch_size), i , broadcast[:,i]]
edit:
answer by #hpaulj worked perfectly and also has the advantage of not having to install any extra dependency
the speed up was much lower than I expected
I ended up finally installing numba
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def calculate_unigram_probability(unigram_scores,seq):
batch_size, seq_len, num_labels = unigram_scores.shape
broadcast = np.broadcast_to(seq, (batch_size, seq_len))
for i in prange( broadcast.shape[1]):
n_seq[i] = unigram_scores[np.arange(batch_size), i , broadcast[:,i]]
return n_seq
which is also taking a a bit too long, Currently I am trying to move it from the cpu to cuda which should bring about the speedup I am hoping for
In [129]: N,M = 5,3
In [130]: unigram=np.arange(N*M*4).reshape(N,M,4)
In [131]: seq = np.arange(M)
In [132]: b_seq = np.broadcast_to(seq, (N,M))
For a single i:
In [133]: i=0; unigram[np.arange(N),i,b_seq[:,i]]
Out[133]: array([ 0, 12, 24, 36, 48])
For all i in the range:
In [136]: i=np.arange(M)[:,None]
In [137]: unigram[np.arange(N),i,b_seq[:,i]]
Out[137]:
array([[[ 0, 12, 24, 36, 48],
[ 5, 17, 29, 41, 53],
[10, 22, 34, 46, 58]],
...
[[ 0, 12, 24, 36, 48],
[ 5, 17, 29, 41, 53],
[10, 22, 34, 46, 58]]])
A (5,3,5) array. This (5,3) might be better)
In [141]: i=np.arange(M); unigram[np.arange(N)[:,None],i,b_seq[:,i]]
Out[141]:
array([[ 0, 5, 10],
[12, 17, 22],
[24, 29, 34],
[36, 41, 46],
[48, 53, 58]])
We don't need to index b_seq: unigram[np.arange(N)[:,None],i,b_seq]
Or even use; let the indexing broadcast seq:
unigram[np.arange(N)[:,None],i,seq]
and with the help of ix_:
In [145]: I,J=np.ix_(np.arange(N), np.arange(M))
In [146]: unigram[I,J,seq]
To get a visual idea of what this indexing does, look at unigram. It's pull 'diagonals' from successive blocks/batches:
In [147]: unigram
Out[147]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]],
...
you can use x.flatten() to reshape a 3d array to 1d array (x must be a numpy array )
in your case :
broadcast = broadcast.flatten()
this will transform an array of shape (NM1000) to an array of one dimension

How to perform operations on certain rows of one np array based on conditions of another np array using numpy methods?

For example, I have one np array A = [[30, 60, 50...], [ 15, 20, 18...], [21, 81, 50...]...] of size (N, 10).
And I have another np array B = [1, 1, 0...] of size (N, ).
I want to do operations E.g. I want all the sums of each column in A but only for rows where B==1. How would I do that without using any loops and just numpy methods?
So if I want sum of columns in A for indices where B == 1:
result = 30 + 15 because the first two indices in B are 1 but the third index is 0 so I wouldn't include it in the sum.
Use np.compress and sum along axis=0
>>> A = [[30, 60, 50], [ 15, 20, 18], [21, 81, 50]]
>>> B = [1, 1, 0]
>>> np.compress(B, A, axis=0).sum(0)
array([45, 80, 68])
If array, use np.nonzero on B:
>>> A = np.array([[30, 60, 50], [ 15, 20, 18], [21, 81, 50]])
>>> A[np.nonzero(B)].sum(0)
array([45, 80, 68])
Another way:
>>> A[B.astype(bool)].sum(0)
array([45, 80, 68])
If you want 0s:
>>> np.compress(B==0, A, axis=0).sum(0)
# Or,
>>> A[np.nonzero(B==0)].sum(0)
# Or,
>>> A[~B.astype(bool)].sum(0)
If you want both 1s and 0s, obviously:
>>> A.sum(0)
You can convert B to bool type and mask A. Then you can get the sum along columns.
A = np.array([[30, 60, 50], [ 15, 20, 18], [21, 81, 50]])
B = np.array([1, 1, 0])
A[B.astype(np.bool)].sum(axis=0)
array([45, 80, 68])

How to create an NumPy array based on the index stored in another array?

Let say I have this NumPY array
A =
array([[0, 1, 3],
[1, 2, 4]])
I have another array
B =
array([[10, 41, 26, 50, 12, 24],
[20, 15, 42, 40, 41, 62]])
I wanted to create another array, where it selects the element in B using the index of the column in A. That is
C =
array([[10, 41, 50],
[15, 42, 41]])
Try:
B[[[0],[1]], A]
Or more generally:
B[np.arange(A.shape[0])[:,None], A]
Output:
array([[10, 41, 50],
[15, 42, 41]])
You can use np.take_along_axis
np.take_along_axis(B, A, axis=1)
output:
array([[10, 41, 50],
[15, 42, 41]])
This can be simply done using list rather than numpy
Though, in the ending we can convert it into numpy.
Code:
import numpy as np
#to make it simpler take a 1d list
a = [0,1,3]
b = [10, 41, 26, 50, 12, 24]
c = []
a = np.array(a)
b = np.array(b)
#here we are using for loop to find the value in a and append the index of b in c
for i in range(len(a)):
print(i)
i = a[i]
c.append(b[i])
print(c)
c = np.array(c)
print(type(c))
#To make it more fun, you can use the random module to get random digits

How to get triangle upper matrix without the diagonal using numpy

Lets say I have the following matrix:
A = np.array([
[1,2,3],
[4,5,6],
[7,8,9]])
How can I extract the upper triangle matrix without the diagonal efficiently?
The output would be the following array:
B = np.array([2,3,6])
One approach with masking -
def upper_tri_masking(A):
m = A.shape[0]
r = np.arange(m)
mask = r[:,None] < r
return A[mask]
Another with np.triu_indices -
def upper_tri_indexing(A):
m = A.shape[0]
r,c = np.triu_indices(m,1)
return A[r,c]
Sample run -
In [403]: A
Out[403]:
array([[79, 17, 79, 58, 14],
[87, 63, 89, 26, 31],
[69, 34, 90, 24, 96],
[59, 60, 80, 52, 46],
[75, 80, 11, 61, 47]])
In [404]: upper_tri_masking(A)
Out[404]: array([17, 79, 58, 14, 89, 26, 31, 24, 96, 46])
Runtime test -
In [415]: A = np.random.randint(0,9,(5000,5000))
In [416]: %timeit upper_tri_masking(A)
10 loops, best of 3: 64.2 ms per loop
In [417]: %timeit upper_tri_indexing(A)
1 loop, best of 3: 252 ms per loop
Short answer
A[np.triu_indices_from(A, k=1)]
Long answer:
You can get the indices of the upper triangle in your matrix using:
indices = np.triu_indices_from(A)
indices
Out[1]:
(array([0, 0, 0, 1, 1, 2], dtype=int64),
array([0, 1, 2, 1, 2, 2], dtype=int64))
This will include the diagonal indices, to exclude them you can offset the diagonal by 1:
indices_with_offset = np.triu_indices_from(A, k=1)
indices_with_offset
Out[2]:
(array([0, 0, 1], dtype=int64),
array([1, 2, 2], dtype=int64))
Now use these with your matrix as a mask
A[indices_with_offset]
Out[3]:
array([2, 3, 6])
See docs here
np.triu(A, k=1)
indices = np.where(np.triu(np.ones(A.shape), k=1).astype(bool))
print(A[x])
[2 3 6]
To summarize other answers. The shortest answer could be:
B = A[np.triu_indices_from(A,k=1)]

Remove elements when satisfying certain condition

I have a large list:
a=[[4,34,1], [5,87,2], [2,76,9],...]
I want to compare all pairs of sub-lists, such that if
a[i][0]>a[j][0] and a[i][1]>a[j][1]
then the sub-list a[i] should be removed.
How could I achieve this goal in Python 2.7?
Here's a slightly more idiomatic way of implementing #MisterMiyagi approach:
drop = set()
for i, j in itertools.combinations(range(len(a)), 2):
# I would've used ``enumerate`` here as well, but it is
# easier to see the filtering criteria with explicit
# indexing.
if a[i][0] > a[j][0] and a[i][1] > a[j][1]:
drop.add(i)
a = [value for idx, value in enumerate(a) if idx not in drop]
print(a)
How is it more idiomatic?
Combinatorial iterator from itertools instead of a double forloop.
No extra 0: in slices.
enumerate instead of explicit indexing to build the answer.
P.S. This is a O(N^2) solution so it might take a while for large inputs.
If you sort the list first (an O(n log n) operation), then you can identify
the items to keep (or reject) in one pass by comparing neighbors (an O(n)
operation). So for long lists this should be much faster than comparing all
pairs (an O(n**2) operation).
At the bottom of the post you'll find the code for using_sort:
In [22]: using_sort([[4,34,1], [5,87,2], [2,76,9]])
Out[22]: [[2, 76, 9], [4, 34, 1]]
In [23]: using_sort([[4, 34, 1], [5, 87, 2], [2, 76, 9], [4, 56, 12], [9, 34, 76]])
Out[23]: [[2, 76, 9], [4, 56, 12], [4, 34, 1], [9, 34, 76]]
We can compare that against a O(n**2) algorithm, using_product, based on Sergei Lebedev's answer.
First, let's check that they give the same result:
import numpy as np
tests = [
[[4, 34, 1], [5, 87, 2], [2, 76, 9], [4, 56, 12], [9, 34, 76]],
[[87, 26, 37], [50, 37, 23], [70, 97, 19], [86, 91, 55], [57, 55, 68],
[25, 35, 64], [82, 79, 66], [1, 30, 75], [16, 14, 71], [32, 89, 6]],
np.random.randint(100, size=(10, 3)).tolist(),
np.random.randint(100, size=(50, 3)).tolist(),
np.random.randint(100, size=(100, 3)).tolist()]
assert all([sorted(using_product(test)) == sorted(using_sort(test))
for test in tests])
Here is a benchmark showing using_sort is much faster than using_product.
Since using_sort is O(n log n) while using_product is O(n**2),
the speed advantage increases with the length of a.
In [17]: a = np.random.randint(100, size=(10**4, 3)).tolist()
In [20]: %timeit using_sort(a)
100 loops, best of 3: 9.44 ms per loop
In [21]: %timeit using_product(a)
1 loops, best of 3: 6.17 s per loop
I found visualizing the solution helpful. For each point in the result there is
a blue rectangular region emanating from it with the given point in the lower
left corner. This rectangular region depicts the set of points which can be
eliminated due to that point being in the result.
With using_sort, each time a point is found in the result, it keeps checking subsequent points in the sorted list against this point until it finds the next point in the result.
import itertools as IT
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
np.random.seed(2016)
def using_sort(a):
if len(a) == 0: return []
a = sorted(a, key=lambda x: (x[0], -x[1]))
result = []
pt = a[0]
nextpt = pt
for key, grp in IT.groupby(a, key=lambda x: x[0]):
for item in grp:
if not (item[0] > pt[0] and item[1] > pt[1]):
result.append(item)
nextpt = item
pt = nextpt
return result
def using_product(a):
drop = set()
for i, j in IT.product(range(len(a)), repeat=2):
if (i != j
and i not in drop
and j not in drop
and a[i][0] > a[j][0]
and a[i][1] > a[j][1]):
drop.add(i)
a = [value for idx, value in enumerate(a) if idx not in drop]
return a
def show(a, *args, **kwargs):
a = sorted(a, key=lambda x: (x[0], -x[1]))
points = np.array(a)[:, :2]
ax = kwargs.pop('ax', plt.gca())
xmax, ymax = kwargs.pop('rects', [None, None])
ax.plot(points[:, 0], points[:, 1], *args, **kwargs)
if xmax:
for x, y in points:
rect = mpatches.Rectangle((x, y), xmax-x, ymax-y, color="blue", alpha=0.1)
ax.add_patch(rect)
tests = [
[[4, 34, 1], [5, 87, 2], [2, 76, 9], [4, 56, 12], [9, 34, 76]],
[[87, 26, 37], [50, 37, 23], [70, 97, 19], [86, 91, 55], [57, 55, 68],
[25, 35, 64], [82, 79, 66], [1, 30, 75], [16, 14, 71], [32, 89, 6]],
np.random.randint(100, size=(10, 3)).tolist(),
np.random.randint(100, size=(50, 3)).tolist(),
np.random.randint(100, size=(100, 3)).tolist()]
assert all([sorted(using_product(test)) == sorted(using_sort(test))
for test in tests])
for test in tests:
print('test: {}'.format(test))
show(test, 'o', label='test')
for func, s in [('using_product', 20), ('using_sort', 10)]:
result = locals()[func](test)
print('{}: {}'.format(func, result))
xmax, ymax = np.array(test)[:, :2].max(axis=0)
show(result, 'o--', label=func, markersize=s, alpha=0.5, rects=[xmax, ymax])
print('-'*80)
plt.legend()
plt.show()
Does this work?
a=[[4,94,1], [3,67,2], [2,76,9]]
b = a
c = []
for lista in a:
condition = False
for listb in b:
if (lista[0] > listb[0] and lista[1] > listb[1]):
condition = True
break
if not condition:
c.append(lista)
c will then contain the list of lists you want.
EDIT: Changed boolean condition based on Sergei's comment.

Categories