Add scipy sparse row matrix to another sparse matrix - python

I have a csr_matrix A of shape (70000, 80000) and another csr_matrix Bof shape (1, 80000). How can I efficiently add B to every row of A? One idea is to somehow create a sparse matrix B' which is rows of B repeated, but numpy.repeat does not work and using a matrix of ones to create B' is very memory inefficient.
I also tried iterating through every row of A and adding B to it, but that again is very time inefficient.
Update:
I tried something very simple which seems to be very efficient than the ideas I mentioned above. The idea is to use scipy.sparse.vstack:
C = sparse.vstack([B for x in range(A.shape[0])])
A + C
This performs well for my task! Few more realizations: I initially tried an iterative approach where I called vstackmultiple times, this approach is slower than calling it just once.

A + B[np.zeros(A.shape[0])] is another way to expand B to the same shape as A.
It has about the same performance and memory footprint as Warren Weckesser's solution:
import numpy as np
import scipy.sparse as sparse
N, M = 70000, 80000
A = sparse.rand(N, M, density=0.001).tocsr()
B = sparse.rand(1, M, density=0.001).tocsr()
In [185]: %timeit u = sparse.csr_matrix(np.ones((A.shape[0], 1), dtype=B.dtype)); Bp = u * B; A + Bp
1 loops, best of 3: 284 ms per loop
In [186]: %timeit A + B[np.zeros(A.shape[0])]
1 loops, best of 3: 280 ms per loop
and appears to be faster than using sparse.vstack:
In [187]: %timeit A + sparse.vstack([B for x in range(A.shape[0])])
1 loops, best of 3: 606 ms per loop

Related

Numpy searchsorted along many dimensions? [duplicate]

Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop

Numpy: Fill in a Matrix With a Smaller One FAST

I'd like to fill in one matrix with copies of another one, like so:
for i in range(N):
for j in range(M):
matA[:,:,:,i,j] = matB
But I have many big dimensions, so I am looking for a faster way.
We could simply get a view into the input with np.broadcast_to to get the desired output -
matA = np.broadcast_to(matB[:,:,:,None,None], matB.shape + (N,M))
Being a view, its virtually free -
In [292]: matB = np.random.rand(20,20,20)
In [293]: N,M = 20,20
In [294]: %timeit np.broadcast_to(matB[:,:,:,None,None], matB.shape + (N,M))
100000 loops, best of 3: 4.02 µs per loop
If you need an output with its own memory space, create a copy with matA.copy().
Alternatively, we could use np.repeat -
np.repeat(matB[:,:,:,None],N*M,axis=-1).reshape(matB.shape+(N,M))

Most Pythonic way to multiply these two vectors?

I have two ndarrays with shapes:
A = (32,512,640)
B = (4,512)
I need to multiply A and B such that I get a new ndarray:
C = (4,32,512,640)
Another way to think of it is that each row of vector B is multiplied along axis=-2 of A, which results in a new 1,32,512,640 cube. Each row of B can be looped over forming 1,32,512,640 cubes, which can then be used to build C up by using np.concatenate or np.vstack, such as:
# Sample inputs, where the dimensions aren't necessarily known
a = np.arange(32*512*465, dtype='f4').reshape((32,512,465))
b = np.ones((4,512), dtype='f4')
# Using a loop
d = []
for row in b:
d.append(np.expand_dims(row[None,:,None]*a, axis=0))
# Or using list comprehension
d = [np.expand_dims(row[None,:,None]*a,axis=0) for row in b]
# Stacking the final list
result = np.vstack(d)
But I am wondering if it's possible to use something like np.einsum or np.tensordot to get this vectorized all in one line. I'm still learning how to use those two methods, so I'm not sure if it's appropriate here.
Thanks!
We can leverage broadcasting after extending the dimensions of B with None/np.newaxis -
C = A * B[:,None,:,None]
With einsum, it would be -
C = np.einsum('ijk,lj->lijk',A,B)
There's no sum-reduction happening here, so einsum won't be any better than the explicit-broadcasting one. But since, we are looking for Pythonic solution, that could be used, once we get past its string notation.
Let's get some timings to finish things off -
In [15]: m,n,r,p = 32,512,640,4
...: A = np.random.rand(m,n,r)
...: B = np.random.rand(p,n)
In [16]: %timeit A * B[:,None,:,None]
10 loops, best of 3: 80.9 ms per loop
In [17]: %timeit np.einsum('ijk,lj->lijk',A,B)
10 loops, best of 3: 109 ms per loop
# Original soln
In [18]: %%timeit
...: d = []
...: for row in B:
...: d.append(np.expand_dims(row[None,:,None]*A, axis=0))
...:
...: result = np.vstack(d)
10 loops, best of 3: 130 ms per loop
Leverage multi-core
We could leverage multi-core capability of numexpr, which is suited for arithmetic operations and large data and thus gain some performance boost here. Let's time with it -
In [42]: import numexpr as ne
In [43]: B4D = B[:,None,:,None] # this is virtually free
In [44]: %timeit ne.evaluate('A*B4D')
10 loops, best of 3: 64.6 ms per loop
In one-line as : ne.evaluate('A*B4D',{'A':A,'B4D' :B[:,None,:,None]}).
Related post on how to control multi-core functionality.

Numpy elementwise product of 3d array

I have two 3d arrays A and B with shape (N, 2, 2) that I would like to multiply element-wise according to the N-axis with a matrix product on each of the 2x2 matrix. With a loop implementation, it looks like
C[i] = dot(A[i], B[i])
Is there a way I could do this without using a loop? I've looked into tensordot, but haven't been able to get it to work. I think I might want something like tensordot(a, b, axes=([1,2], [2,1])) but that's giving me an NxN matrix.
It seems you are doing matrix-multiplications for each slice along the first axis. For the same, you can use np.einsum like so -
np.einsum('ijk,ikl->ijl',A,B)
We can also use np.matmul -
np.matmul(A,B)
On Python 3.x, this matmul operation simplifies with # operator -
A # B
Benchmarking
Approaches -
def einsum_based(A,B):
return np.einsum('ijk,ikl->ijl',A,B)
def matmul_based(A,B):
return np.matmul(A,B)
def forloop(A,B):
N = A.shape[0]
C = np.zeros((N,2,2))
for i in range(N):
C[i] = np.dot(A[i], B[i])
return C
Timings -
In [44]: N = 10000
...: A = np.random.rand(N,2,2)
...: B = np.random.rand(N,2,2)
In [45]: %timeit einsum_based(A,B)
...: %timeit matmul_based(A,B)
...: %timeit forloop(A,B)
100 loops, best of 3: 3.08 ms per loop
100 loops, best of 3: 3.04 ms per loop
100 loops, best of 3: 10.9 ms per loop
You just need to perform the operation on the first dimension of your tensors, which is labeled by 0:
c = tensordot(a, b, axes=(0,0))
This will work as you wish. Also you don't need a list of axes, because it's just along one dimension you're performing the operation. With axes([1,2],[2,1]) you're cross multiplying the 2nd and 3rd dimensions. If you write it in index notation (Einstein summing convention) this corresponds to c[i,j] = a[i,k,l]*b[j,k,l], thus you're contracting the indices you want to keep.
EDIT: Ok, the problem is that the tensor product of a two 3d object is a 6d object. Since contractions involve pairs of indices, there's no way you'll get a 3d object by a tensordot operation. The trick is to split your calculation in two: first you do the tensordot on the index to do the matrix operation and then you take a tensor diagonal in order to reduce your 4d object to 3d. In one command:
d = np.diagonal(np.tensordot(a,b,axes=()), axis1=0, axis2=2)
In tensor notation d[i,j,k] = c[i,j,i,k] = a[i,j,l]*b[i,l,k].

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
for j in range(nb):
D[i, j] = foo(va[i], vb[j])
As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.
D = cdist(va, vb, metric)
I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.
One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:
In [2]: def f(a, b):
...: return a + b
...:
In [3]: f_vec = np.vectorize(f)
In [4]: f_ufunc = np.frompyfunc(f, 2, 1) # 2 inputs, 1 output
In [5]: a = np.random.rand(1000)
In [6]: b = np.random.rand(2000)
In [7]: %timeit np.add.outer(a, b) # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop
In [8]: %timeit f_vec(a[:, None], b) # 50x slower than np.add
1 loops, best of 3: 488 ms per loop
In [9]: %timeit f_ufunc(a[:, None], b) # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop
In [10]: %timeit f_ufunc.outer(a, b) # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop
So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).
cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
Use enumerate, instead of range + explicitly indexing
Use a python speed "magic", such as cython or numba, to speed up the looping process.
If you can make further assumptions about foo, it might be possible to speed it up further.
Like #shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:
In [11]: N = 400
In [12]: B = np.empty((N, N))
In [13]: x = np.random.random(N)
In [14]: y = np.random.random(N)
In [15]: %%timeit
for i in range(N):
for j in range(N):
B[i, j] = x[i] - y[j]
....:
10 loops, best of 3: 87.2 ms per loop
In [16]: %timeit A = np.subtract.outer(x, y) # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop
Otherwise you can push the looping down to cython level. Continuing a trivial example above:
In [45]: %%cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
cdef int i, j
for i in xrange(x.shape[0]):
for j in xrange(y.shape[0]):
out[i, j] = x[i] - y[j]
....:
In [46]: foo(x, y, B)
In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True
In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop
The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.

Categories