I'd like to fill in one matrix with copies of another one, like so:
for i in range(N):
for j in range(M):
matA[:,:,:,i,j] = matB
But I have many big dimensions, so I am looking for a faster way.
We could simply get a view into the input with np.broadcast_to to get the desired output -
matA = np.broadcast_to(matB[:,:,:,None,None], matB.shape + (N,M))
Being a view, its virtually free -
In [292]: matB = np.random.rand(20,20,20)
In [293]: N,M = 20,20
In [294]: %timeit np.broadcast_to(matB[:,:,:,None,None], matB.shape + (N,M))
100000 loops, best of 3: 4.02 µs per loop
If you need an output with its own memory space, create a copy with matA.copy().
Alternatively, we could use np.repeat -
np.repeat(matB[:,:,:,None],N*M,axis=-1).reshape(matB.shape+(N,M))
Related
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
I have two ndarrays with shapes:
A = (32,512,640)
B = (4,512)
I need to multiply A and B such that I get a new ndarray:
C = (4,32,512,640)
Another way to think of it is that each row of vector B is multiplied along axis=-2 of A, which results in a new 1,32,512,640 cube. Each row of B can be looped over forming 1,32,512,640 cubes, which can then be used to build C up by using np.concatenate or np.vstack, such as:
# Sample inputs, where the dimensions aren't necessarily known
a = np.arange(32*512*465, dtype='f4').reshape((32,512,465))
b = np.ones((4,512), dtype='f4')
# Using a loop
d = []
for row in b:
d.append(np.expand_dims(row[None,:,None]*a, axis=0))
# Or using list comprehension
d = [np.expand_dims(row[None,:,None]*a,axis=0) for row in b]
# Stacking the final list
result = np.vstack(d)
But I am wondering if it's possible to use something like np.einsum or np.tensordot to get this vectorized all in one line. I'm still learning how to use those two methods, so I'm not sure if it's appropriate here.
Thanks!
We can leverage broadcasting after extending the dimensions of B with None/np.newaxis -
C = A * B[:,None,:,None]
With einsum, it would be -
C = np.einsum('ijk,lj->lijk',A,B)
There's no sum-reduction happening here, so einsum won't be any better than the explicit-broadcasting one. But since, we are looking for Pythonic solution, that could be used, once we get past its string notation.
Let's get some timings to finish things off -
In [15]: m,n,r,p = 32,512,640,4
...: A = np.random.rand(m,n,r)
...: B = np.random.rand(p,n)
In [16]: %timeit A * B[:,None,:,None]
10 loops, best of 3: 80.9 ms per loop
In [17]: %timeit np.einsum('ijk,lj->lijk',A,B)
10 loops, best of 3: 109 ms per loop
# Original soln
In [18]: %%timeit
...: d = []
...: for row in B:
...: d.append(np.expand_dims(row[None,:,None]*A, axis=0))
...:
...: result = np.vstack(d)
10 loops, best of 3: 130 ms per loop
Leverage multi-core
We could leverage multi-core capability of numexpr, which is suited for arithmetic operations and large data and thus gain some performance boost here. Let's time with it -
In [42]: import numexpr as ne
In [43]: B4D = B[:,None,:,None] # this is virtually free
In [44]: %timeit ne.evaluate('A*B4D')
10 loops, best of 3: 64.6 ms per loop
In one-line as : ne.evaluate('A*B4D',{'A':A,'B4D' :B[:,None,:,None]}).
Related post on how to control multi-core functionality.
I am using python 3.6 and numpy. I have a n dimensional array. I need to perform both partition and argpartition on the last dimension of the array. I can obviously call both functions, but it feels like wasting resources. Is there a way to get the results of both np.partition and np.argpartition at the same time? There should be a way to get the result of np.partition applying to the array the indices I get from np.argpartition but I don't see it at the moment!
Thank you!
Get those argpartition indices and then use advanced-indexing to get the partitioned array.
Thus, an implementation for a generic ndarray of any number of dimensions and along any generic axis, would be like so -
def partition_results(a, k, axis=-1):
idx = np.argpartition(a, k, axis=axis)
index_arr = list(np.ix_(*[range(i) for i in a.shape]))
index_arr[axis] = idx
return idx, a[index_arr]
np.ix_ gives us the "spread-out" range arrays to accomplish the task of advanced-indexing. These range arrays are needed to cover all dimensions corresponding to the lengths of the axes in argpartition indices array except the last one, for which we have those argpartition indices themselves. This setup is needed for such an indexing operation.
So, with the approach of using two separate calls to np.argpartition and np.partition, we would have it, like so -
def partition_results_exclusive_way(a, k):
idx = np.argpartition(a, k, axis=-1)
part_arr = np.partition(a, k, axis=-1)
return idx , part_arr
We will use it for comparison on performance and value verification in the next section.
Sample run and runtime test -
In [496]: a = np.random.rand(20,20,20,20,20)
In [502]: A0, B0 = partition_results_exclusive_way(a, 10)
In [503]: A1, B1 = partition_results(a, 10)
In [504]: np.allclose(A0,A1)
Out[504]: True
In [505]: np.allclose(B0,B1)
Out[505]: True
In [506]: %timeit partition_results_exclusive_way(a, 10)
10 loops, best of 3: 92.6 ms per loop
In [507]: %timeit partition_results(a, 10)
10 loops, best of 3: 76 ms per loop
Dissecting a bit more on the performance numbers, let's time argpartition and partition separately -
In [509]: %timeit np.argpartition(a, 10, axis=-1)
10 loops, best of 3: 49.6 ms per loop
In [510]: %timeit np.partition(a, 10, axis=-1)
10 loops, best of 3: 43.6 ms per loop
So, the advanced-indexing operation costed us around half of what we had with np.partition. We are definitely saving there!
I have a csr_matrix A of shape (70000, 80000) and another csr_matrix Bof shape (1, 80000). How can I efficiently add B to every row of A? One idea is to somehow create a sparse matrix B' which is rows of B repeated, but numpy.repeat does not work and using a matrix of ones to create B' is very memory inefficient.
I also tried iterating through every row of A and adding B to it, but that again is very time inefficient.
Update:
I tried something very simple which seems to be very efficient than the ideas I mentioned above. The idea is to use scipy.sparse.vstack:
C = sparse.vstack([B for x in range(A.shape[0])])
A + C
This performs well for my task! Few more realizations: I initially tried an iterative approach where I called vstackmultiple times, this approach is slower than calling it just once.
A + B[np.zeros(A.shape[0])] is another way to expand B to the same shape as A.
It has about the same performance and memory footprint as Warren Weckesser's solution:
import numpy as np
import scipy.sparse as sparse
N, M = 70000, 80000
A = sparse.rand(N, M, density=0.001).tocsr()
B = sparse.rand(1, M, density=0.001).tocsr()
In [185]: %timeit u = sparse.csr_matrix(np.ones((A.shape[0], 1), dtype=B.dtype)); Bp = u * B; A + Bp
1 loops, best of 3: 284 ms per loop
In [186]: %timeit A + B[np.zeros(A.shape[0])]
1 loops, best of 3: 280 ms per loop
and appears to be faster than using sparse.vstack:
In [187]: %timeit A + sparse.vstack([B for x in range(A.shape[0])])
1 loops, best of 3: 606 ms per loop
I have two 3d arrays A and B with shape (N, 2, 2) that I would like to multiply element-wise according to the N-axis with a matrix product on each of the 2x2 matrix. With a loop implementation, it looks like
C[i] = dot(A[i], B[i])
Is there a way I could do this without using a loop? I've looked into tensordot, but haven't been able to get it to work. I think I might want something like tensordot(a, b, axes=([1,2], [2,1])) but that's giving me an NxN matrix.
It seems you are doing matrix-multiplications for each slice along the first axis. For the same, you can use np.einsum like so -
np.einsum('ijk,ikl->ijl',A,B)
We can also use np.matmul -
np.matmul(A,B)
On Python 3.x, this matmul operation simplifies with # operator -
A # B
Benchmarking
Approaches -
def einsum_based(A,B):
return np.einsum('ijk,ikl->ijl',A,B)
def matmul_based(A,B):
return np.matmul(A,B)
def forloop(A,B):
N = A.shape[0]
C = np.zeros((N,2,2))
for i in range(N):
C[i] = np.dot(A[i], B[i])
return C
Timings -
In [44]: N = 10000
...: A = np.random.rand(N,2,2)
...: B = np.random.rand(N,2,2)
In [45]: %timeit einsum_based(A,B)
...: %timeit matmul_based(A,B)
...: %timeit forloop(A,B)
100 loops, best of 3: 3.08 ms per loop
100 loops, best of 3: 3.04 ms per loop
100 loops, best of 3: 10.9 ms per loop
You just need to perform the operation on the first dimension of your tensors, which is labeled by 0:
c = tensordot(a, b, axes=(0,0))
This will work as you wish. Also you don't need a list of axes, because it's just along one dimension you're performing the operation. With axes([1,2],[2,1]) you're cross multiplying the 2nd and 3rd dimensions. If you write it in index notation (Einstein summing convention) this corresponds to c[i,j] = a[i,k,l]*b[j,k,l], thus you're contracting the indices you want to keep.
EDIT: Ok, the problem is that the tensor product of a two 3d object is a 6d object. Since contractions involve pairs of indices, there's no way you'll get a 3d object by a tensordot operation. The trick is to split your calculation in two: first you do the tensordot on the index to do the matrix operation and then you take a tensor diagonal in order to reduce your 4d object to 3d. In one command:
d = np.diagonal(np.tensordot(a,b,axes=()), axis1=0, axis2=2)
In tensor notation d[i,j,k] = c[i,j,i,k] = a[i,j,l]*b[i,l,k].