mmap sparse vector in python - python

I'm looking for simple sparse vector implementation that can be mapped into memory, similarly to numpy.memmap.
Unfortunately, numpy implementation deals only with full vector. Example usage:
vec = SparseVector('/tmp/file.dat') # SparseVector is the class I'm looking for
vec[10] = 10
vec[50] = 21
for key in vec:
print vec[key] # 10, 21
I foung scipy class representing sparse matrix, however 2 dimensions are clumsy to use as I'd need to make matrix with only one row a then use vec[0,i].
Any suggestions?

Someone else was just asking about 1d sparse vectors, only they wanted to take advantage of the scipy.sparse method of handling duplicate indices.
is there something like coo_matrix but for sparse vectors?
As shown there, a coo_matrix actually consists of 3 numpy arrays, data, row, col. Other formats rearrange the values in other ways, lil for example has 2 nested lists, one for the data, another for the coordinates. dok is a regular dictionary, with (i,j) tuples as keys.
In theory then a sparse vector will require 2 arrays. Or as your example shows it could be a simple dictionary.
So you could implement a mmap sparse vector by using two mmap arrays. As far as I know there isn't a mmap version of the scipy sparse matrices, though it's not something I've looked for.
But what functionality do you want? What dimension? So large that a dense version would not fit in regular memory? Are you doing math with it? Or just data lookup?

Related

Python - matrix multiplication with sparse result

Suppose I have two dense matrices U (10000x50) and V(50x10000), and one sparse matrix A(10000x10000). Each element in A is either 1 or 0. I hope to find A*(UV), noting that '*' is element-wise multiplication. To solve the problem, Scipy/numpy will calculate a dense matrix UV first. But UV is dense and large (10000x10000) so it's very slow.
Because I only need a few elements of UV indicated by A, it should save a lot of time if only necessary elements are calculated instead of calculating all elements then filtering using A. Is there a way to instruct scipy to do this?
BTW, I used Matlab to solve this problem and Matlab is smart enough to find what I'm trying to do and works efficiently.
Update:
I found Matlab calculated UV fully as scipy does. My scipy installation is simply too slow...
Here's a test script and possible speedup. The basic idea is to use the nonzero coordinates of A to select rows and columns of U and V, and then use einsum to perform a subset of the possible dot products.
import numpy as np
from scipy import sparse
#M,N,d = 10,5,.1
#M,N,d = 1000,50,.1
M,N,d = 5000,50,.01 # about the limit for my memory
A=sparse.rand(M,M,d)
A.data[:] = 1 # a sparse 0,1 array
U=(np.arange(M*N)/(M*N)).reshape(M,N)
V=(np.arange(M*N)/(M*N)).reshape(N,M)
A1=A.multiply(U.dot(V)) # the direct solution
A2=np.einsum('ij,ik,kj->ij',A.A,U,V)
print(np.allclose(A1,A2))
def foo(A,U,V):
# use A to select elements of U and V
A3=A.copy()
U1=U[A.row,:]
V1=V[:,A.col]
A3.data[:]=np.einsum('ij,ji->i',U1,V1)
return A3
A3 = foo(A,U,V)
print(np.allclose(A1,A3.A))
The 3 solutions match. For large arrays, foo is about 2x faster than the direct solution. For small size, the pure einsum is competitive, but bogs down for large arrays.
The use of dot in foo would have computed too many products, ij,jk->ik as opposed to ij,ji->i.

Sparse-Dense multiplication in Python

I am using Python 3.23 and I am want to multiply a sparse VECTOR with a dense MATRIX. The idea of first unfolding the sparse vector into a dense one and then multiplying is of course silly from any standpoint except for mem management until the actual unfolding. It will be more expensive with zeros in there...
Also, does any one know of a good way for SciPy to keep one dimensional matrices in sparse mode? The only one (admittedly) i have used is the classical notation of three vectors (x,y,value), so i have had to use np.ones(len(...)) to get it to work.
Well.. comments welcome!
Store the vector using the Scipy sparse matrix classes:
x = csr_matrix(np.random.rand(1000) > 0.99).T
print x.shape # (1000, 1)

Efficient numpy / lapack routine for product of inverse and sparse matrix?

I have a matrix B that is square and dense, and a matrix A that is rectangular and sparse.
Is there a way to efficiently compute the product B^-1 * A?
So far, I use (in numpy)
tmp = B.inv()
return tmp * A
which, I believe, makes us of A's sparsity. I was thinking about using the sparse method
numpy.sparse.linalg.spsolve, but this requires B, and not A, to be sparse.
Is there another way to speed things up?
Since the matrix to be inverted is dense, spsolve is not the tool you want. In addition, it is bad numerical practice to calculate the inverse of a matrix and multiply it by another - you are much better off using LU decomposition, which is supported by scipy.
Another point is that unless you are using the matrix class (I think that the ndarray class is better, this is something of a question of taste), you need to use dot instead of the multiplication operator. And if you want to efficiently multiply a sparse matrix by a dense matrix, you need to use the dot method of the sparse matrix. Unfortunately this only works if the first matrix is sparse, so you need to use the trick which Anycorn suggested of taking the transpose to swap the order of operations.
Here is a lazy implementation which doesn't use the LU decomposition, but which should otherwise be efficient:
B_inv = scipy.linalg.inv(B)
C = (A.transpose().dot(B_inv.transpose())).transpose()
Doing it properly with the LU decomposition involves finding a way to efficiently multiply a triangular matrix by a sparse matrix, which currently eludes me.

Is there an efficient way of concatenating scipy.sparse matrices?

I'm working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matrix from separate parts.
Right now I'm using the following way to concatenate four matrices, but it's horribly inefficient. Is there any better way to do this that doesn't involve converting to a dense matrix?
rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridge
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()
The sparse library now has hstack and vstack for respectively concatenating matrices horizontally and vertically.
Amos's answer is no longer necessary. Scipy now does something similar to this internally if the input matrices are in csr or csc format and the desired output format is set to none or the same format as the input matrices. It's efficient to vertically stack matrices in csr format, or to horizontally stack matrices in csc format, using scipy.sparse.vstack or scipy.sparse.hstack, respectively.
Using hstack, vstack, or concatenate, is dramatically slower than concatenating the inner data objects themselves. The reason is that hstack/vstack converts the sparse matrix to coo format which can be very slow when the matrix is very large not and not in coo format. Here is the code for concatenating csc matrices, similar method can be used for csr matrices:
def concatenate_csc_matrices_by_columns(matrix1, matrix2):
new_data = np.concatenate((matrix1.data, matrix2.data))
new_indices = np.concatenate((matrix1.indices, matrix2.indices))
new_ind_ptr = matrix2.indptr + len(matrix1.data)
new_ind_ptr = new_ind_ptr[1:]
new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))
return csc_matrix((new_data, new_indices, new_ind_ptr))
Okay, I found the answer. Using scipy.sparse.coo_matrix is much much faster than using lil_matrix. I converted the matrices to coo (painless and fast) and then just concatenated the data, rows and columns after adding the right padding.
data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1]))
scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )

Addressing ranges in a Scipy sparse matrix

I have a large matrix, currently in numpy that i would like to port over to scipy sparse matrix, because saving the text representations of the numpy (2000,2000) matrix is over 100mb.
(1) There seem to a surfeit of sparse matrices available in scipy [for instance, lil_matrix or dok_matrix- which one would be optimal for simple incrementing, and efficient to save to a database?
(2)
I'd like to be able to address ranges in the matrix like so:
>> import numpy as np
>> a = np.zeros((1000,1000))
>> a[3:5,4:7] += 1
It seems that this is not possible for the sparse matrices?
I can't say which is most efficient to store. It's going to depend on your data.
I can say, however that the += operator works, just that you can't rely on the usual array broadcasting rules:
>>> m = sparse.lil_matrix((100,100))
>>> m[50:56,50:56]+=scipy.ones((6,6))
>>> m[50,50] #1.0

Categories