How to incrementally create an sparse matrix on python? - python

I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers.
After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that.
Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element

A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of overhead in the dok indexing method. In some cases, I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.
dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
As for the top N values of each row, I recall exploring that, but back some time, so can't pull up a good SO question offhand. You probably want a row oriented format such as lil or csr.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
I can't think of anything in a dok or coo format matrix that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
==================
It looks like all the code for the dok format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
Scanning that file, I see that dok does have a resize method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
So if you want to initialize the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
If you know for sure that there aren't any keys that fall outside the new shape, you could change _shape directly. The other sparse formats don't have a resize method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
See also:
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Get top-n items of every row in a scipy sparse matrix

Related

Matrix multiplication: maintain scipy.sparse.dok_matrix format

I am trying to use scipy to perform sparse linear algebra calculations in the dok (dictionary of keys) format.
When I multiply two matricies together the format changes from dok type to csr format which is an inefficient format for the data and subsequent operations.
How can I keep the dok format?
I have looked at the docs:
scipy sparse matrix
dok_matrix
But cannot see any information automatic type conversion or if and how it can be avoided.
See this example:
from scipy.sparse import dok_matrix
my_mat = dok_matrix([[1,2], [3,4]])
print(type(my_mat.dot(my_mat)))
print(type(my_mat # my_mat))
shows that the format has been changed:
<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
Just convert back:
result = result.todok()
CSR may be an inefficient format for subsequent operations (or maybe not, we can't tell), but it's great for matrix multiplication. Trying to make the matrix multiplication code operate on a DOK result natively would be slower than just converting the result.
As pointed out by #user2357112 csr is good for linear algebra. The cost of conversion is, however, significant. As dok is not the only format that supports acceptable time editing it is worthwhile to check out the other option which is lil. Depending on your use case you may save quite a bit of time:
from scipy import sparse
from timeit import timeit
a = random(100,100,0.1,format='lil')
b = random(100,100,0.1,format='dok')
a
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in LInked List format>
b
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in Dictionary Of Keys format>
timeit(lambda:(a#a).tolil(),number=100)*10
# 1.491789099527523
timeit(lambda:(b#b).todok(),number=100)*10
# 4.220661079743877
Note that a#a/b#b is rather dense in this example, if we choose a sparser case the difference is less pronounced:
a = random(100,100,0.01,format='lil')
b = random(100,100,0.01,format='dok')
timeit(lambda:(a#a).tolil(),number=100)*10
# 0.6880075298249722
timeit(lambda:(b#b).todok(),number=100)*10
# 0.7450748200062662

Convert compressed sparse matrix to dataframe

When trying to turn a roughly (2,000,000x3) array of one hot encoded values into a data frame I encounter a 'DataFrame constructor not properly called!' error.
I've also explicitly tried wrapping the array in np.asarray() but get an 'Must pass 2-d input' error.
enc = skp.OneHotEncoder()
X_ismale = enc.fit_transform(X.IsMaleBucket.values.reshape(-1,1))
X_ismale = pd.DataFrame(X_ismale,columns=['IsMale_'+str(i) for i in np.sort(X.IsMaleBucket.unique())])
X_ismale has type:
<2256308x3 sparse matrix of type '<class 'numpy.float64'>'
with 2256308 stored elements in Compressed Sparse Row format>
Error is as previously described.
I expect an errorless conversion to dataframe but can't get it.
Pandas cannot work with sparse matrices, only with dense data. You can use to_array to convert the sparse matrix to a dense array. – jdehesa 9 mins ago
Using to_array worked although the current version turned out to be toarray.
Thanks.

Populate an empty CSR sparse matrix with columns of another csr matrix and slicing it

(Python)
Can anyone please suggest the easiest and fastest way to populate a csr matrix A with the values from the columns of another csr matrix B which is of size 400k*800k.
My failed attempt:
#x is a list of size 500 which contains the column numbers needed from B
A=sparse.csr_matrix((400000,500))
for i in range(400000):
for j in range(500):
A[i,j]=B[i,x[j]]
Also is there an easy way to split the matrix B in the ratio of 4:1
It helps to think about the problem as if A and B were both dense arrays first. If I understand your question right, you'd want something like:
A = B[:, x]
It turns out that you can do the same operation with CSR matrices as well, and it's reasonably efficient. The key is to avoid assigning values to an existing sparse matrix (especially if it's in CSR or CSC format). By doing the indexing all at once, scipy is able to use more efficient methods.

Construct sparse matrix on disk on the fly in Python

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M). I'm constructing this matrix column by column when reading a corpus of 5M documents. For this purpose I use a sparse dok_matrix data structure from SciPy. However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes. What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn, but, as said, it is impossible to hold and construct the entire matrix in memory. I've looked into numpy.memmap, as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.
It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).
My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?
Thanks!
We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk. I'll show you a small simple example of how I would deal with this. My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once. This will work even if you can't fit one entire copy.
I would construct an on disk sparse CSC matrix column by column. A sparse csc matrix uses 3 underlying arrays:
data: the values stored in the matrix
indices: the row index for each value in the matrix
indptr: an array of length n_cols + 1, which divides indices and data by which column they belong to.
As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data. Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]].
To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.
import numpy as np
import h5py
from scipy import sparse
from tqdm import tqdm # For monitoring the writing process
from typing import Tuple, Union # Just for argument annotation
def process_document():
"""
Simulate processing a document. Results in sparse vector represenation.
"""
n_items = np.random.negative_binomial(2, .0001)
indices = np.random.choice(2_000_000, n_items, replace=False)
indices.sort()
data = np.random.random(n_items).astype(np.float32)
return indices, data
def data_generator(n):
"""Iterator which yields simulated data."""
for i in range(n):
yield process_document()
Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.
def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
"""
Create a group in an hdf5 file that can store a CSC sparse matrix.
"""
g = f.create_group(groupname)
g.attrs["shape"] = shape
g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
return g
And finally a function for reading this group as a sparse matrix (this one is pretty simple).
def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])
Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).
N_COLS = 10
def make_disk_matrix(f, groupname, data_iter, shape):
group = make_sparse_csc_group(f, "mtx", shape)
indptr = group["indptr"]
data = group["data"]
indices = group["indices"]
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices.resize((n_total,))
data.resize((n_total,))
indices[n_prev:] = cur_indices
data[n_prev:] = cur_data
indptr[doc_num+1] = n_total
# Writing
with h5py.File("data.h5", "w") as f:
make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))
# Reading
with h5py.File("data.h5", "r") as f:
mtx = read_sparse_csc_group(f["mtx"])
Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it. A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions). However, using a slight modification of this code should give you better performance:
def make_memory_mtx(data_iter, shape):
indices_list = []
data_list = []
indptr = np.zeros(shape[1]+1, dtype=int)
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices_list.append(cur_indices)
data_list.append(cur_data)
indptr[doc_num+1] = n_total
indices = np.concatenate(indices_list)
data = np.concatenate(data_list)
return sparse.csc_matrix((data, indices, indptr), shape=shape)
mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))
This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays. Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.
It would be great if you could provide a minimal working code. I can't see if your matrix gets too big by construction (1) or just because you have too much data (2). If you don't really care about building this matrix yourself, you can directly look at my remark 2.
For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk. The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment. When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples. This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.
import numpy as np
import scipy.sparse
class SparseMatrixBuilder():
def __init__(self, shape, build_size_limit):
self.sparse_matrix = scipy.sparse.csr_matrix(shape)
self.shape = shape
self.build_size_limit = build_size_limit
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def add(self, data, col_indices, row_indices):
self.data_temp.append(data)
self.col_indices_temp.append(col_indices)
self.row_indices_temp.append(row_indices)
if len(self.data_temp) == self.build_size_limit:
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def get_matrix(self):
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
return self.sparse_matrix
For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached. As such, you'll end up with multiple chunks of sparse matrices on disk. Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).
remark 1: the buffer limit here is not really well defined. It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).
remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models. So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.
I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO. If it does not, there is almost no hope you will be able to proceed with sklearn, even by using mmap. Indeed sklearn will likely create subsequent objects with memory requirements of the same order of magnitude as your input.
Scipy's dok_matrix are actually a sub-class of the vanilla dict. They store the data using individual python objects and tons of pointers, so they are not memory efficient. The most compact representation is the coo_matrix format. You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data; and eventually increase these buffers if your initial guess was wrong.
def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
counter = 0
rows = numpy.empty(n_data_hint, dtype=idx_dtype)
cols = numpy.empty(n_data_hint, dtype=idx_dtype)
data = numpy.empty(n_data_hint, dtype=data_dtype)
for row, col, value in iterable:
if counter >= n_data_hint:
n_data_hint *= 2
rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
rows[counter] = row
cols[counter] = col
data[counter] = value
counter += 1
rows = rows[:counter]
cols = cols[:counter]
data = data[:counter]
return coo_matrix((data, (rows, cols)))
def _reallocate(rows, cols, data, n):
new_rows = numpy.empty(n, dtype=rows.dtype)
new_cols = numpy.empty(n, dtype=cols.dtype)
new_data = numpy.empty(n, dtype=data.dtype)
new_rows[:rows.size] = rows
new_cols[:cols.size] = cols
new_data[:data.size] = data
return new_rows, new_cols, new_data
which you can test with randomly-generated data like this:
def get_random_data(n, max_row=2000, max_col=5000):
for _ in range(n):
row = numpy.random.choice(max_row)
col = numpy.random.choice(max_col)
val = numpy.random.randn()
yield row, col, val
# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)
# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)
Once you have your COO matrix, you may want to convert to CSR using coo.tocsr(). The CSR matrices are more optimized for common operations such as dot product.
It requires a bit more memory in the case where some rows were empty originally. This is because it stores pointers for all rows even empty ones.
Look at here, at the end he explain how to store and read directly sparse matrix to a Hdf5 file.

How do you edit cells in a sparse matrix using scipy?

I'm trying to manipulate some data in a sparse matrix. Once I've created one, how do I add / alter / update values in it? This seems very basic, but I can't find it in the documentation for the sparse matrix classes, or on the web. I think I'm missing something crucial.
This is my failed attempt to do so the same way I would a normal array.
>>> from scipy.sparse import bsr_matrix
>>> A = bsr_matrix((10,10))
>>> A[5][7] = 6
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
A[5][7] = 6
File "C:\Python27\lib\site-packages\scipy\sparse\bsr.py", line 296, in __getitem__
raise NotImplementedError
NotImplementedError
There several Sparse matrix formats. Some are better suited to indexing. One that has implemented it is lil_matrix.
Al = A.tolil()
Al[5,7] = 6 # the normal 2d matrix indexing notation
print Al
print Al.A # aka Al.todense()
A1 = Al.tobsr() # if it must be in bsr format
The documentation for each format suggests what it is good at, and where it is bad. But it does not have a neat list of which ones have which operations defined.
Advantages of the LIL format
supports flexible slicing
changes to the matrix sparsity structure are efficient
...
Intended Usage
LIL is a convenient format for constructing sparse matrices
...
dok_matrix also implements indexing.
The underlying data structure for coo_matrix is easy to understand. It is essentially the parameters for coo_matrix((data, (i, j)), [shape=(M, N)]) definition. To create the same matrix you could use:
sparse.coo_matrix(([6],([5],[7])), shape=(10,10))
If you have more assignments, build larger data, i, j lists (or 1d arrays), and when complete construct the sparse matrix.
The documentation for bsr is here bsr matrix and for csr is here csr matrix. It might be worth it to understand the csr before moving to the bsr. The only difference is that bsr has entries that are matrices themselves whereas the basic unit in a csr is a scalar.
I don't know if there are super easy ways to manipulate the matrices once they are created, but here are some examples of what you're trying to do,
import numpy as np
from scipy.sparse import bsr_matrix, csr_matrix
row = np.array( [5] )
col = np.array( [7] )
data = np.array( [6] )
A = csr_matrix( (data,(row,col)) )
This is a straightforward syntax in which you list all the data you want in the matrix in the array data and then specify where that data should go using row and col. Note that this will make the matrix dimensions just big enough to hold the element in the largest row and column ( in this case a 6x8 matrix ). You can see the matrix in standard form using the todense() method.
A.todense()
However, you cannot manipulate the matrix on the fly using this pattern. What you can do is modify the native scipy representation of the matrix. This involves 3 attributes, indices, indptr, and data. To start with, we can examine the value of these attributes for the array we've already created.
>>> print A.data
array([6])
>>> print A.indices
array([7], dtype=int32)
>>> print A.indptr
array([0, 0, 0, 0, 0, 0, 1], dtype=int32)
data is the same thing it was before, a 1-d array of values we want in the matrix. The difference is that the position of this data is now specified by indices and indptr instead of row and col. indices is fairly straightforward. It simply a list of which column each data entry is in. It will always be the same size and the data array. indptr is a little trickier. It lets the data structure know what row each data entry is in. To quote from the docs,
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]]
From this definition we can see that the size of indptr will always be the number of rows in the matrix + 1. It takes a little while to get used to it, but working through the values for each row will give you some intuition. Note that all the entries are zero until the last one. That means that the column indices for rows i=0-4 are going to be stored in indices[0:0] i.e. the empty array. This is because these rows are all zeros. Finally, on the last row, i=5 we get indices[0:1]=7 which tells us the data entry(ies) data[0:1] are in row 5, column 7.
Now suppose we wanted to add the value 10 at row 2 column 4. We first put it into the data attribute,
A.data = np.array( [10,6] )
next we update indices to indicate the column 10 will be in,
A.indices = np.array( [4,7], dtype=np.int32 )
and finally we indicate which row it will be in by modifying indptr
A.indptr = np.array( [0,0,0,1,1,1,2], dtype=np.int32 )
It is important that you make the data type of indices and indptr np.int32. One way to visualize what's going in in indptr is that the change in numbers occurs as you move from i to i+1 of a row that has data. Also note that arrays like these can be used to construct sparse matrices
B = csr_matrix( (data,indices,indptr) )
It would be nice if it was as easy as simply indexing into the array as you tried, but the implementation is not there yet. That should be enough to get you started at least.

Categories