pytables/HDF5 sparse matric multiplication

pytables/HDF5 sparse matric multiplication - python

I work with up to ~100000x100000 sparse complex valued matrices (A), with a sparsity of maximum 3%. I wish to do things like expm(-At)v, A\v , etc., where v is an arbitrary vector. I used MATLAB, but I believe this is too much for MATLAB's memory. So I decided to switch to Python HDF5/Pytables.
However, I am unable to find modules in Python Pytables, etc. that do a sparse matrix multiplication in HDF5 style. I don't want to run a for loop over chunks within the sparse matrix, where you borrow a chunk from the matrix, multiply it with an appropriate chunk grabbed from a vector and then move on to the next chunk, without storing the full, humongous matrix in the RAM. The for loop seems too time consuming.
My eventual goal is to integrate this hard-disk based multiplication routine into modules like scipy.sparse.linalg.expm_multiply to calculate expm(-At)*v, etc. for such huge sparse matrices.

Related

How to covert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?

I have a very large sparse Numpy matrix (of type numpy.ndarray). The matrix is so large that it probably has to be stored in the virtual memory. How can I efficiently convert it to a sparse Scipy matrix (from scipy.sparse)(to be used for arithmetic operations) ?
The following is a direct conversion with dok_matrix, which fails due to a memory issue presumably. Changing dok_matrix to csr_matrix causes the same memory issue.
In [1]: N=int(1e6)
In [2]: from scipy.sparse import dok_matrix
In [3]: import numpy as np
In [4]: mat=np.zeros((N,N))
In [5]: dok_matrix(mat)
zsh: killed ipython
My current method is to use a nested loop, which is slow even if I do nothing.
In [9]: for i in range(N):
...: for j in range (N):
...: pass
Any efficient solution to convert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?

When you do mat = np.zeros((N,N)), Numpy allocates a big matrix full of zeros. To do that, it requests a big zeroized buffer to the operating system (OS). Most OS do not actually perform the allocation in physical memory but in virtual memory. The memory pages are mapped during a first-touch. This means any read or write cause virtual pages to be mapped in physical memory. Please see this post for more information about this. Also please consider reading the famous article What Every Programmer Should Know About Memory for more information about virtual memory. The thing is dok_matrix(mat) needs to read the whole matrix so to create the sparse matrix so it will trigger the mapping of all pages of the matrix resulting in an out of memory. When there is no more space left, the Linux's OOM-Killer kills programs using too much memory, hence a killed ipython message. The same problem happens with any kind of sparse matrices. The main problem is you cannot read the whole created matrix. In fact, creating such matrix is nearly pointless unless you know that only some tiny parts will be read/written (never the full matrix).
The solution is to create a space matrix directly and fill it like Numpy dense matrices. It is significantly slower, but this is the price to pay for using sparse matrices. DOK matrices generally takes a lot of memory unless only few items are filled. That being said, DOK matrices are one of the fastest for random accesses (because they are implemented using a hash-table internally). CSR are good for matrices that do not change once created (ie. modifying a CSR matrix is very slow) and with only few non-zero items per line. Note that CSR matrices can be created quite quickly from a data 1D array and (row_ind, col_ind) array tuple. See the documentation for more information.
The sparsity of the matrix, that is, the ratio NumberOfNonZeroValue / NumberOfValues needs to be (very) small for the sparse matrix to be useful. Sparse matrices tends to be slow because their internal representation operates on a kind of compressed matrix.

Python - Store individual numpy matrices to minimize memory on disk and loading time. Binary files?

I wrote some code to generate a large dataset of complex numpy matrices for ML applications, which I would like to somehow store on disk.
The most suitable idea seems to be saving the matrices into separate binary files. However, commands such as bytearray() seems to flatten the matrices into 1D arrays, thus losing the information over the matrix shape.
I guess I might need to fill each line independently, maybe using an additional for loop, but this would also require a for loop when loading and re-assembling the matrix.
What would be the correct procedure for storing those matrices in a way that minimizes the amount of space on disk and loading time?

Construct Sparse Matrix in Matlab from Compressed Sparse Column (CSC) format

I have a large sparse matrix (~5 billion non-zero values) in Python, stored in the csc_matrix format. I need to open it as a sparse matrix in Matlab. savemat apparently cannot save data of this size (seems to be capped at ~5GB), so I am resorting to saving it as an hdf5 file, as detailed here. However, I am having trouble opening it in matlab.
Given these three vectors: data, indices, indptr, whose meaning is explained:
standard CSC representation where the row indices for column i
are stored in indices[indptr[i]:indptr[i+1]] and their corresponding
values are stored in data[indptr[i]:indptr[i+1]].
How can I construct this matrix in Matlab? I can open these three vectors in Matlab using h5read no problem, but I don't know how to use them to construct the sparse matrix. This is not the format of the sparse command I usually use to construct a sparse matrix.

The following code works, but is very slow. Any suggestions would be appreciated.
X=zeros(shape(1),shape(2));
for k=1:length(indptr)-1
i=indptr(k)+1:indptr(k+1);
y=indices(i)+1;
X(y,k)=data(i);
end

index milion row square matrix for fast access

I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)

Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.

Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.

Sparse Efficiency Warning while changing the column

def tdm_modify(feature_names,tdm):
non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
,'minister','said','told','say','injury','victim','report']
indexes=[feature_names.index(word) for word in non_useful_words]
for index in indexes:
tdm[:,index]=0
return tdm
I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?
C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)

First, it is not an error. It's a warning. The next time you perform this action (in a session) it will do it without warning.
To me the message is clear:
Changing the sparsity structure of a csr_matrix is expensive.
lil_matrix is more efficient.
tdm is a csr_matrix. The way that data is stored with the format, it takes quite a bit of extra computation to set a bunch of the elements to 0 (or v.v to change them from 0). As it says, the lil_matrix format is better if you need to do this sort of change frequently.
Try some time tests on a sample matrices. tdm.tolil() will convert the matrix to lil format.
I could get into how the data is stored and why changing csr is less efficient than lil.
I'd suggest reviewing the sparse formats, and their respective pros and cons.
A simple way to think about is - csr (and csc) are designed for fast numerical calculations, especially matrix multiplication. They developed for linear algebra problems. coo is a convenient way of defining sparse matrices. lil is a convenient way for building matrices incrementally.
How are you constructing tdm initially?
In scipy test files (e.g. scipy/sparse/linalg/dsolve/tests/test_linsolve.py) I find code that does
import warnings
from scipy.sparse import (spdiags, SparseEfficiencyWarning, csc_matrix,
csr_matrix, isspmatrix, dok_matrix, lil_matrix, bsr_matrix)
warnings.simplefilter('ignore',SparseEfficiencyWarning)
scipy/sparse/base.py
class SparseWarning(Warning):
pass
class SparseFormatWarning(SparseWarning):
pass
class SparseEfficiencyWarning(SparseWarning):
pass
These warnings use the standard Python Warning class, so standard Python methods for controlling their expression apply.

I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:
My exact task was to build a 25000 x 90000 matrix of uint8.
My desired output was a sparse matrix compressed row format, i.e. csr_matrix.
The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.
The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).
The slowest way was to assemble the csr_matrix element by element.
So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.