I'm working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matrix from separate parts.
Right now I'm using the following way to concatenate four matrices, but it's horribly inefficient. Is there any better way to do this that doesn't involve converting to a dense matrix?
rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridge
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()
The sparse library now has hstack and vstack for respectively concatenating matrices horizontally and vertically.
Amos's answer is no longer necessary. Scipy now does something similar to this internally if the input matrices are in csr or csc format and the desired output format is set to none or the same format as the input matrices. It's efficient to vertically stack matrices in csr format, or to horizontally stack matrices in csc format, using scipy.sparse.vstack or scipy.sparse.hstack, respectively.
Using hstack, vstack, or concatenate, is dramatically slower than concatenating the inner data objects themselves. The reason is that hstack/vstack converts the sparse matrix to coo format which can be very slow when the matrix is very large not and not in coo format. Here is the code for concatenating csc matrices, similar method can be used for csr matrices:
def concatenate_csc_matrices_by_columns(matrix1, matrix2):
new_data = np.concatenate((matrix1.data, matrix2.data))
new_indices = np.concatenate((matrix1.indices, matrix2.indices))
new_ind_ptr = matrix2.indptr + len(matrix1.data)
new_ind_ptr = new_ind_ptr[1:]
new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))
return csc_matrix((new_data, new_indices, new_ind_ptr))
Okay, I found the answer. Using scipy.sparse.coo_matrix is much much faster than using lil_matrix. I converted the matrices to coo (painless and fast) and then just concatenated the data, rows and columns after adding the right padding.
data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1]))
scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )
Related
I have created a diagonal numpy array:
a = numpy.float32(numpy.random.rand(10))
a = numpy.diagonal(a)
However, I face MemoryError since my matrix is extremely large. Is there anyway to save the memory?
The best way to handle this case is to create a sparse matrix using scipy.sparse.diags as follows:
a = numpy.float32(numpy.random.rand(10))
a = sparse.diags(a)
If the shape of your diagonal numpy array is n*n, utilizing sparse.diags would result in a matrix n times smaller. Almost all matrix operations are supported for sparse matrices.
I'm using sparse to construct, store, and read a large sparse matrix. I'd like to use Dask arrays to use its blocked algorithms features.
Here's a simplified version of what I'm trying to do:
file_path = './{}'.format('myfile.npz')
if os.path.isfile(file_path):
# Load file with sparse matrix
X_sparse = sparse.load_npz(file_path)
else:
# All matrix elements are initially equal to 0
coords, data = [], []
X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
# Create file for later retrieval
sparse.save_npz(file_path, X_sparse)
# Create Dask array from matrix to allow usage of blocked algorithms
X = da.from_array(X_sparse, chunks='auto').map_blocks(sparse.COO)
return X
Unfortunately, the code above throws the following error when trying to use compute() with X: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.; but I cannot transform the sparse matrix to dense in memory, as it will result in an error.
Any ideas in how to accomplish this?
You can have a look at the following issue:
https://github.com/dask/dask/issues/4523
Basically, sparse by intention prevents automatic conversion into a dense matrix.
However, by setting the environment variable SPARSE_AUTO_DENSIFY=1 you can override this behavior. Nevertheless, this only solves the bug but does not accomplish your main goal.
What you would need to do is to split your file into multiple *.npz sparse matrices, load these with sparse in a delayed manner (see dask.delayed) and concatenate those into one large sparse Dask array.
I will have to implement something like this in the near future. IMHO this should be supported by Dask more natively...
dask.array.from_array now supports COO and GCXS sparse arrays natively.
Using dask version '2022.01.0':
In [18]: # All matrix elements are initially equal to
...: coords, data = [], []
...: X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
...:
...: # Create Dask array from matrix to allow usage of blocked algorithms
...: X = dask.array.from_array(X_sparse, chunks="auto").map_blocks(sparse.COO)
In [19]: X
Out[19]: dask.array<COO, shape=(88506, 1440000), dtype=float64, chunksize=(4023, 4000), chunktype=sparse.COO>
See the dask docs on Sparse Arrays for more information.
Support for sparse arrays was added way back in 2017; stability & API support has been steadily improving ever since.
I have this code
import numpy as np
from scipy.sparse import csr_matrix
q = csr_matrix([[1.], [0.]])
ones = np.ones((2, 1))
and now how to add ones column to matrix q to have result shape (2, 2)?
(matrix q is sparse and I don't want to change type from csr)
The code for sparse.hstack is
return bmat([blocks], format=format, dtype=dtype)
for bmat, then blocks is a 1xN array. If they are all csc, it does a fast version of stack:
A = _compressed_sparse_stack(blocks[0,:], 1)
Conversely sparse.vstack with csr matrixes does
A = _compressed_sparse_stack(blocks[:,0], 0)
In effect given how data is stored in a csr matrix it it relatively easy to add rows (or columns for csc) (I can elaborate if that needs explanation).
Otherwise bmat does:
# convert everything to COO format
# calculate total nnz
data = np.empty(nnz, dtype=dtype)
for B in blocks:
data[nnz:nnz + B.nnz] = B.data
return coo_matrix((data, (row, col)), shape=shape).asformat(format)
In other words it gets the data, row, col values for each block, concatenates them, makes a new coo matrix, and finally converts it to the desire format.
sparse readily converts between formats. Even the display of a matrix can involve a conversion - to coo for the (i,j) d format, to csr for dense/array. sparse.nonzero converts to coo. Most math converts to csr. A csr is transposed by converting it to a csc (without change of attribute arrays). Much of the conversion is done in compiled code so you don't see delays.
Adding columns directly to csr format is a lot of work. All 3 attribute arrays have to be modified row by row. Again I could go into detail if needed.
Suppose I have two dense matrices U (10000x50) and V(50x10000), and one sparse matrix A(10000x10000). Each element in A is either 1 or 0. I hope to find A*(UV), noting that '*' is element-wise multiplication. To solve the problem, Scipy/numpy will calculate a dense matrix UV first. But UV is dense and large (10000x10000) so it's very slow.
Because I only need a few elements of UV indicated by A, it should save a lot of time if only necessary elements are calculated instead of calculating all elements then filtering using A. Is there a way to instruct scipy to do this?
BTW, I used Matlab to solve this problem and Matlab is smart enough to find what I'm trying to do and works efficiently.
Update:
I found Matlab calculated UV fully as scipy does. My scipy installation is simply too slow...
Here's a test script and possible speedup. The basic idea is to use the nonzero coordinates of A to select rows and columns of U and V, and then use einsum to perform a subset of the possible dot products.
import numpy as np
from scipy import sparse
#M,N,d = 10,5,.1
#M,N,d = 1000,50,.1
M,N,d = 5000,50,.01 # about the limit for my memory
A=sparse.rand(M,M,d)
A.data[:] = 1 # a sparse 0,1 array
U=(np.arange(M*N)/(M*N)).reshape(M,N)
V=(np.arange(M*N)/(M*N)).reshape(N,M)
A1=A.multiply(U.dot(V)) # the direct solution
A2=np.einsum('ij,ik,kj->ij',A.A,U,V)
print(np.allclose(A1,A2))
def foo(A,U,V):
# use A to select elements of U and V
A3=A.copy()
U1=U[A.row,:]
V1=V[:,A.col]
A3.data[:]=np.einsum('ij,ji->i',U1,V1)
return A3
A3 = foo(A,U,V)
print(np.allclose(A1,A3.A))
The 3 solutions match. For large arrays, foo is about 2x faster than the direct solution. For small size, the pure einsum is competitive, but bogs down for large arrays.
The use of dot in foo would have computed too many products, ij,jk->ik as opposed to ij,ji->i.
I'm looking for simple sparse vector implementation that can be mapped into memory, similarly to numpy.memmap.
Unfortunately, numpy implementation deals only with full vector. Example usage:
vec = SparseVector('/tmp/file.dat') # SparseVector is the class I'm looking for
vec[10] = 10
vec[50] = 21
for key in vec:
print vec[key] # 10, 21
I foung scipy class representing sparse matrix, however 2 dimensions are clumsy to use as I'd need to make matrix with only one row a then use vec[0,i].
Any suggestions?
Someone else was just asking about 1d sparse vectors, only they wanted to take advantage of the scipy.sparse method of handling duplicate indices.
is there something like coo_matrix but for sparse vectors?
As shown there, a coo_matrix actually consists of 3 numpy arrays, data, row, col. Other formats rearrange the values in other ways, lil for example has 2 nested lists, one for the data, another for the coordinates. dok is a regular dictionary, with (i,j) tuples as keys.
In theory then a sparse vector will require 2 arrays. Or as your example shows it could be a simple dictionary.
So you could implement a mmap sparse vector by using two mmap arrays. As far as I know there isn't a mmap version of the scipy sparse matrices, though it's not something I've looked for.
But what functionality do you want? What dimension? So large that a dense version would not fit in regular memory? Are you doing math with it? Or just data lookup?