Problem
Consider
P: A (N, n_x) matrix.
Then I want to find the indices of a sparse COO matrix such that
indices = []
for i in range(N):
for j1 in range(n_x):
for j2 in range(n_x):
indices.append([P[i, j1], P[i, j2]])
indices = unique(indices, axis=0)
Faster Solution
The above solution is both inefficient in terms of time and memory. A faster option using Numpy is below
col_idx = np.reshape(np.tile(P, n_x), [N, n_x, n_x])
row_idx = np.transpose(col_idx, [0,2,1])
indices = np.concatenate((row_idx[:,None], col_idx[:, None]), axis=1)
indices = np.unique(indices, axis=0)
Note however that this still requires building 2 N*n_x*n_x arrays which can be much larger than necessary if we only have a small number of unique elements.
Question
How can I build a fast but also memory efficient algorithm for doing the following. Currently the fast solution is not usable as it requires too much memory.
The solution could be Python but also an algorithm that I could code in C would suffice.
I think both in C++ and Python the way to go is using a set.
In the version below I used Numba which gives apox. 30x speedup to the pure Python version.
Python
import numba as nb
import numpy as np
import time
N=500
n_x=600
P=np.random.randint(0,50,N*n_x).reshape(N,n_x)
#nb.jit()
def nb_sparse_coo(P):
indices = set()
for i in range(P.shape[0]):
for j1 in range(P.shape[1]):
for j2 in range(j1,P.shape[1]):
indices.add((P[i, j1], P[i, j2]))
return np.array(list(indices))
indices=nb_sparse_coo(P)
Related
Given two sparse scipy matrices A, B I want to compute the row-wise outer product.
I can do this with numpy in a number of ways. The easiest perhaps being
np.einsum('ij,ik->ijk', A, B).reshape(n, -1)
or
(A[:, :, np.newaxis] * B[:, np.newaxis, :]).reshape(n, -1)
where n is the number of rows in A and B.
In my case, however, going through dense matrices eat up way too much RAM.
The only option I have found is thus to use a python loop:
sp.sparse.vstack((ra.T#rb).reshape(1,-1) for ra, rb in zip(A,B)).tocsr()
While using less RAM, this is very slow.
My question is thus, is there a sparse (RAM efficient) way to take the row-wise outer product of two matrices, which keeps things vectorized?
(A similar question is numpy elementwise outer product with sparse matrices but all answers there go through dense matrices.)
We can directly calculate the csr representation of the result. It's not superfast (~3 seconds on 100,000x768) but may be ok, depending on your use case:
import numpy as np
import itertools
from scipy import sparse
def spouter(A,B):
N,L = A.shape
N,K = B.shape
drows = zip(*(np.split(x.data,x.indptr[1:-1]) for x in (A,B)))
data = [np.outer(a,b).ravel() for a,b in drows]
irows = zip(*(np.split(x.indices,x.indptr[1:-1]) for x in (A,B)))
indices = [np.ravel_multi_index(np.ix_(a,b),(L,K)).ravel() for a,b in irows]
indptr = np.fromiter(itertools.chain((0,),map(len,indices)),int).cumsum()
return sparse.csr_matrix((np.concatenate(data),np.concatenate(indices),indptr),(N,L*K))
A = sparse.random(100,768,0.03).tocsr()
B = sparse.random(100,768,0.03).tocsr()
print(np.all(np.einsum('ij,ik->ijk',A.A,B.A).reshape(100,-1) == spouter(A,B).A))
A = sparse.random(100000,768,0.03).tocsr()
B = sparse.random(100000,768,0.03).tocsr()
from time import time
T = time()
C = spouter(A,B)
print(time()-T)
Sample run:
True
3.1073222160339355
What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.
I've been recently dealing with matrix constructnion in numpy/scipy, and I am trying to find an optimal solution to the following problem.
I have a function, which returns tuples, consisting of matrix row-indexes and the corresponding rows. At this moment, I simply construct a numpy matrix, such as:
mat = np.zeros((n, n))
where n is the matrix dimension. I incrementally loop through function results and add the corresponding rows to this very matrix in the following way:
mat[result_tuple[0],:] = result_tuple[1]
and it works fine and fast. The problem is, this is quite memory-intensive. Is there a way to do this efficiently with scipy.sparse module?
My current attempt works, but is quite slow:
for result_tuple in results:
col = range(0,n,1)
row = np.repeat(result_tuple[0],n)
val = result_tuple[1]
mat = mat + sp.csr_matrix((val, (row,col)), shape(n,n), dtype=float)
where again, result tuple consists of (row index, row data).
Thank you very much.
Suppose I have two arrays A and B with dimensions (n1,m1,m2) and (n2,m1,m2), respectively. I want to compute the matrix C with dimensions (n1,n2) such that C[i,j] = sum((A[i,:,:] - B[j,:,:])^2). Here is what I have so far:
import numpy as np
A = np.array(range(1,13)).reshape(3,2,2)
B = np.array(range(1,9)).reshape(2,2,2)
C = np.zeros(shape=(A.shape[0], B.shape[0]) )
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.sum(np.square(A[i,:,:] - B[j,:,:]))
C
What is the most efficient way to do this? In R I would use a vectorized approach, such as outer. Is there a similar method for Python?
Thanks.
You can use scipy's cdist, which is pretty efficient for such calculations after reshaping the input arrays to 2D, like so -
from scipy.spatial.distance import cdist
C = cdist(A.reshape(A.shape[0],-1),B.reshape(B.shape[0],-1),'sqeuclidean')
Now, the above approach must be memory efficient and thus a better one when working with large datasizes. For small input arrays, one can also use np.einsum and leverage NumPy broadcasting, like so -
diffs = A[:,None]-B
C = np.einsum('ijkl,ijkl->ij',diffs,diffs)
This question has two parts (maybe one solution?):
Sample vectors from a sparse matrix: Is there an easy way to sample vectors from a sparse matrix?
When I'm trying to sample lines using random.sample I get an TypeError: sparse matrix length is ambiguous.
from random import sample
import numpy as np
from scipy.sparse import lil_matrix
K = 2
m = [[1,2],[0,4],[5,0],[0,8]]
sample(m,K) #works OK
mm = np.array(m)
sample(m,K) #works OK
sm = lil_matrix(m)
sample(sm,K) #throws exception TypeError: sparse matrix length is ambiguous.
My current solution is to sample from the number of rows in the matrix, then use getrow(),, something like:
indxSampls = sample(range(sm.shape[0]), k)
sampledRows = []
for i in indxSampls:
sampledRows+=[sm.getrow(i)]
Any other efficient/elegant ideas? the dense matrix size is 1000x30000 and could be larger.
Constructing a sparse matrix from a list of sparse vectors: Now imagine I have the list of sampled vectors sampledRows, how can I convert it to a sparse matrix without densify it, convert it to list of lists and then convet it to lil_matrix?
Try
sm[np.random.sample(sm.shape[0], K, replace=False), :]
This gets you out an LIL-format matrix with just K of the rows (in the order determined by the random.sample). I'm not sure it's super-fast, but it can't really be worse than manually accessing row by row like you're currently doing, and probably preallocates the results.
The accepted answer to this question is outdated and no longer works. With newer versions of numpy, you should use np.random.choice in place of np.random.sample, e.g.:
sm[np.random.choice(sm.shape[0], K, replace=False), :]
as opposed to:
sm[np.random.sample(sm.shape[0], K, replace=False), :]