I'm trying to manipulate some data in a sparse matrix. Once I've created one, how do I add / alter / update values in it? This seems very basic, but I can't find it in the documentation for the sparse matrix classes, or on the web. I think I'm missing something crucial.
This is my failed attempt to do so the same way I would a normal array.
>>> from scipy.sparse import bsr_matrix
>>> A = bsr_matrix((10,10))
>>> A[5][7] = 6
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
A[5][7] = 6
File "C:\Python27\lib\site-packages\scipy\sparse\bsr.py", line 296, in __getitem__
raise NotImplementedError
NotImplementedError
There several Sparse matrix formats. Some are better suited to indexing. One that has implemented it is lil_matrix.
Al = A.tolil()
Al[5,7] = 6 # the normal 2d matrix indexing notation
print Al
print Al.A # aka Al.todense()
A1 = Al.tobsr() # if it must be in bsr format
The documentation for each format suggests what it is good at, and where it is bad. But it does not have a neat list of which ones have which operations defined.
Advantages of the LIL format
supports flexible slicing
changes to the matrix sparsity structure are efficient
...
Intended Usage
LIL is a convenient format for constructing sparse matrices
...
dok_matrix also implements indexing.
The underlying data structure for coo_matrix is easy to understand. It is essentially the parameters for coo_matrix((data, (i, j)), [shape=(M, N)]) definition. To create the same matrix you could use:
sparse.coo_matrix(([6],([5],[7])), shape=(10,10))
If you have more assignments, build larger data, i, j lists (or 1d arrays), and when complete construct the sparse matrix.
The documentation for bsr is here bsr matrix and for csr is here csr matrix. It might be worth it to understand the csr before moving to the bsr. The only difference is that bsr has entries that are matrices themselves whereas the basic unit in a csr is a scalar.
I don't know if there are super easy ways to manipulate the matrices once they are created, but here are some examples of what you're trying to do,
import numpy as np
from scipy.sparse import bsr_matrix, csr_matrix
row = np.array( [5] )
col = np.array( [7] )
data = np.array( [6] )
A = csr_matrix( (data,(row,col)) )
This is a straightforward syntax in which you list all the data you want in the matrix in the array data and then specify where that data should go using row and col. Note that this will make the matrix dimensions just big enough to hold the element in the largest row and column ( in this case a 6x8 matrix ). You can see the matrix in standard form using the todense() method.
A.todense()
However, you cannot manipulate the matrix on the fly using this pattern. What you can do is modify the native scipy representation of the matrix. This involves 3 attributes, indices, indptr, and data. To start with, we can examine the value of these attributes for the array we've already created.
>>> print A.data
array([6])
>>> print A.indices
array([7], dtype=int32)
>>> print A.indptr
array([0, 0, 0, 0, 0, 0, 1], dtype=int32)
data is the same thing it was before, a 1-d array of values we want in the matrix. The difference is that the position of this data is now specified by indices and indptr instead of row and col. indices is fairly straightforward. It simply a list of which column each data entry is in. It will always be the same size and the data array. indptr is a little trickier. It lets the data structure know what row each data entry is in. To quote from the docs,
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]]
From this definition we can see that the size of indptr will always be the number of rows in the matrix + 1. It takes a little while to get used to it, but working through the values for each row will give you some intuition. Note that all the entries are zero until the last one. That means that the column indices for rows i=0-4 are going to be stored in indices[0:0] i.e. the empty array. This is because these rows are all zeros. Finally, on the last row, i=5 we get indices[0:1]=7 which tells us the data entry(ies) data[0:1] are in row 5, column 7.
Now suppose we wanted to add the value 10 at row 2 column 4. We first put it into the data attribute,
A.data = np.array( [10,6] )
next we update indices to indicate the column 10 will be in,
A.indices = np.array( [4,7], dtype=np.int32 )
and finally we indicate which row it will be in by modifying indptr
A.indptr = np.array( [0,0,0,1,1,1,2], dtype=np.int32 )
It is important that you make the data type of indices and indptr np.int32. One way to visualize what's going in in indptr is that the change in numbers occurs as you move from i to i+1 of a row that has data. Also note that arrays like these can be used to construct sparse matrices
B = csr_matrix( (data,indices,indptr) )
It would be nice if it was as easy as simply indexing into the array as you tried, but the implementation is not there yet. That should be enough to get you started at least.
Related
I'd like to preserve the order of insertion into a SciPy csr_matrix, however it seems to always sort it by row and then index:
>>> from scipy.sparse import csr_matrix
>>> x = csr_matrix(([1,2,3],[[3,2,1],[5,2,1]]))
>>> print(x)
(1, 1) 3
(2, 2) 2
(3, 5) 1
Anyway to keep the original sorting? What I want:
(3, 5) 1
(2, 2) 2
(1, 1) 3
ETA: Figured out that inserting using the data, indices, indptr method preserves the order within row (still sorted by row but no longer by column indices). Whereas inserting by data, indices where indices is a 2D indices matrix is then sorted by both row and column indices.
The CSR format stores data in a row-wise format (by marking out the places in the memory-contiguous data array where each row begins and ends). The information that you want does not exist in that format - part of the compression is to remove it.
If you need that ordering information you could leave it in COO format with the caveat that there are operations which result in COO matrices being sorted without warning. It may be best to store that information explicitly instead of implicitly (do scipy sparse matrices let you use structs in the data matrix?).
I have a sparse matrix A in Python and I want to add 14 to the first column.
A[:,0] + 14
However, I get an error message:
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
You can add an explicit column like this:
A[:, 0] = np.ones((A.shape[0], 1))*14 + A[:, 0]
I ran into a similar situation (as described in your question title) and after some research, I found that you can manually change the shape of your matrix but then, this doesn't look like the best solution, for this reason, I started a discussion here and my final solution was to manually create the sparse matrix (ìndices, indptr and data lists) so I could add new columns, rows and change the matrix sparsity at will.
The description of your question suggests a different problem, you don't want to add a new column, but to change the value of an element from the matrix. If this alters the matrix sparsity, I would suggest you have your own ìndices, indptr and data lists. If you want to modify a non-zero element, then you can change it directly without further problems.
Also, this might be worth reading
(Python)
Can anyone please suggest the easiest and fastest way to populate a csr matrix A with the values from the columns of another csr matrix B which is of size 400k*800k.
My failed attempt:
#x is a list of size 500 which contains the column numbers needed from B
A=sparse.csr_matrix((400000,500))
for i in range(400000):
for j in range(500):
A[i,j]=B[i,x[j]]
Also is there an easy way to split the matrix B in the ratio of 4:1
It helps to think about the problem as if A and B were both dense arrays first. If I understand your question right, you'd want something like:
A = B[:, x]
It turns out that you can do the same operation with CSR matrices as well, and it's reasonably efficient. The key is to avoid assigning values to an existing sparse matrix (especially if it's in CSR or CSC format). By doing the indexing all at once, scipy is able to use more efficient methods.
I have read data from a file and stored into a matrix (frag_coords):
frag_coords =
[[ 916.0907976 -91.01391344 120.83596334]
[ 916.01117655 -88.73389753 146.912555 ]
[ 924.22832597 -90.51682575 120.81734705]
...
[ 972.55384732 708.71316138 52.24644577]
[ 972.49089559 710.51583744 72.86369124]]
type(frag_coords) =
class 'numpy.matrixlib.defmatrix.matrix'
I do not have any issues when reordering the matrix by a specified column. For example, the code below works just fine:
order = np.argsort(frag_coords[:,2], axis=0)
My issue is that:
len(frag_coords[0]) = 1
I need to access the individual numbers of the first row individually, I've tried splitting it, transforming it into a list and everything seems to return the 3 numbers not as columns but rather as a single element with len=1. I need help please!
Your problem is that you're using a matrix instead of an ndarray. Are you sure you want that?
For a matrix, indexing the first row alone leads to another matrix, a row matrix. Check frag_coords[0].shape: it will be (1,3). For an ndarray, it would be (3,).
If you only need to index the first row, use two indices:
frag_coords[0,j]
Or if you store the row temporarily, just index into it as a row matrix:
tmpvar = frag_coords[0] # shape (1,3)
print(tmpvar[0,2]) # for column 2 of row 0
If you don't need too many matrix operations, I'd advise that you use np.arrays instead. You can always read your data into an array directly, but at a given point you can just transform an existing matrix with np.array(frag_coords) too if you wish.
I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers.
After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that.
Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element
A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of overhead in the dok indexing method. In some cases, I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.
dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
As for the top N values of each row, I recall exploring that, but back some time, so can't pull up a good SO question offhand. You probably want a row oriented format such as lil or csr.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
I can't think of anything in a dok or coo format matrix that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
==================
It looks like all the code for the dok format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
Scanning that file, I see that dok does have a resize method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
So if you want to initialize the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
If you know for sure that there aren't any keys that fall outside the new shape, you could change _shape directly. The other sparse formats don't have a resize method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
See also:
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Get top-n items of every row in a scipy sparse matrix