(This question relates to "populate a Pandas SparseDataFrame from a SciPy Sparse Matrix". I want to populate a SparseDataFrame from a scipy.sparse.coo_matrix (specifically) The mentioned question is for a different SciPy Sparse Matrix (csr)...
So here it goes...)
I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.coo_matrix() or coo_matrix()? Converting to dense format kills RAM badly...!
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
A convenience method SparseSeries.from_coo() is implemented for creating a SparseSeries from a scipy.sparse.coo_matrix.
Within scipy.sparse there are methods that convert the data forms to each other. .tocoo, .tocsc, etc. So you can use which ever form is best for a particular operation.
For going the other way, I've answered
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
Your linked answer from 2013 iterates by row - using toarray to make the row dense. I haven't looked at what the pandas from_coo does.
A more recent SO question on pandas sparse
non-NDFFrame object error using pandas.SparseSeries.from_coo() function
From https://github.com/pydata/pandas/blob/master/pandas/sparse/scipy_sparse.py
def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s
In effect it takes the same data, i, j used to build a coo matrix, makes a series, sorts it, and turns it into a sparse series.
Related
So I want to merge 2 datasets, 1 is a single band raster dataset that came from rioxarray.open_rasterio(), the other a lookup table, with an index dim 'mukey'. The coords along 'mukey' correspond to 'mukey' index values in the lookup table. The desired result is a dataset with identical x and y coords to the Raster dataset, with variables 'n' and 'K' whose values are populated by merging on the 'mukey'. If you are familiar with ArcGIS, this is the analogous operation.
xr.merge() and assign() don't quite seem to perform this operation, and cheating by converting into pandas or numpy hits memory issues on my 32GB machine. Does xarray provide any support for this simple use case? Thanks,
data = (np.abs(np.random.randn(12000000))).astype(np.int32).reshape(1,4000,3000)
raster = xr.DataArray(data,dims=['band','y','x'],coords=[[1],np.arange(4000),np.arange(3000)])
raster = raster.to_dataset(name='mukey')
raster
lookup = pd.DataFrame({'mukey':list(range(10)),'n':np.random.randn(10),'K':np.random.randn(10)*2}).set_index('mukey').to_xarray()
lookup
You're looking for the advanced indexing with DataArrays feature of xarray.
You can provide a DataArray as a keyword argument to DataArray.sel or Dataset.sel - this will reshape the indexed array along the dimensions of the indexing array, based on the values of the indexing array. I think this is exactly what you're looking for in a "lookup table".
In your case:
lookup.sel(mukey=raster.mukey)
I am trying to use scipy to perform sparse linear algebra calculations in the dok (dictionary of keys) format.
When I multiply two matricies together the format changes from dok type to csr format which is an inefficient format for the data and subsequent operations.
How can I keep the dok format?
I have looked at the docs:
scipy sparse matrix
dok_matrix
But cannot see any information automatic type conversion or if and how it can be avoided.
See this example:
from scipy.sparse import dok_matrix
my_mat = dok_matrix([[1,2], [3,4]])
print(type(my_mat.dot(my_mat)))
print(type(my_mat # my_mat))
shows that the format has been changed:
<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
Just convert back:
result = result.todok()
CSR may be an inefficient format for subsequent operations (or maybe not, we can't tell), but it's great for matrix multiplication. Trying to make the matrix multiplication code operate on a DOK result natively would be slower than just converting the result.
As pointed out by #user2357112 csr is good for linear algebra. The cost of conversion is, however, significant. As dok is not the only format that supports acceptable time editing it is worthwhile to check out the other option which is lil. Depending on your use case you may save quite a bit of time:
from scipy import sparse
from timeit import timeit
a = random(100,100,0.1,format='lil')
b = random(100,100,0.1,format='dok')
a
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in LInked List format>
b
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in Dictionary Of Keys format>
timeit(lambda:(a#a).tolil(),number=100)*10
# 1.491789099527523
timeit(lambda:(b#b).todok(),number=100)*10
# 4.220661079743877
Note that a#a/b#b is rather dense in this example, if we choose a sparser case the difference is less pronounced:
a = random(100,100,0.01,format='lil')
b = random(100,100,0.01,format='dok')
timeit(lambda:(a#a).tolil(),number=100)*10
# 0.6880075298249722
timeit(lambda:(b#b).todok(),number=100)*10
# 0.7450748200062662
(Python)
Can anyone please suggest the easiest and fastest way to populate a csr matrix A with the values from the columns of another csr matrix B which is of size 400k*800k.
My failed attempt:
#x is a list of size 500 which contains the column numbers needed from B
A=sparse.csr_matrix((400000,500))
for i in range(400000):
for j in range(500):
A[i,j]=B[i,x[j]]
Also is there an easy way to split the matrix B in the ratio of 4:1
It helps to think about the problem as if A and B were both dense arrays first. If I understand your question right, you'd want something like:
A = B[:, x]
It turns out that you can do the same operation with CSR matrices as well, and it's reasonably efficient. The key is to avoid assigning values to an existing sparse matrix (especially if it's in CSR or CSC format). By doing the indexing all at once, scipy is able to use more efficient methods.
I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers.
After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that.
Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element
A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of overhead in the dok indexing method. In some cases, I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.
dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
As for the top N values of each row, I recall exploring that, but back some time, so can't pull up a good SO question offhand. You probably want a row oriented format such as lil or csr.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
I can't think of anything in a dok or coo format matrix that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
==================
It looks like all the code for the dok format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
Scanning that file, I see that dok does have a resize method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
So if you want to initialize the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
If you know for sure that there aren't any keys that fall outside the new shape, you could change _shape directly. The other sparse formats don't have a resize method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
See also:
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Get top-n items of every row in a scipy sparse matrix
I'm trying to manipulate some data in a sparse matrix. Once I've created one, how do I add / alter / update values in it? This seems very basic, but I can't find it in the documentation for the sparse matrix classes, or on the web. I think I'm missing something crucial.
This is my failed attempt to do so the same way I would a normal array.
>>> from scipy.sparse import bsr_matrix
>>> A = bsr_matrix((10,10))
>>> A[5][7] = 6
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
A[5][7] = 6
File "C:\Python27\lib\site-packages\scipy\sparse\bsr.py", line 296, in __getitem__
raise NotImplementedError
NotImplementedError
There several Sparse matrix formats. Some are better suited to indexing. One that has implemented it is lil_matrix.
Al = A.tolil()
Al[5,7] = 6 # the normal 2d matrix indexing notation
print Al
print Al.A # aka Al.todense()
A1 = Al.tobsr() # if it must be in bsr format
The documentation for each format suggests what it is good at, and where it is bad. But it does not have a neat list of which ones have which operations defined.
Advantages of the LIL format
supports flexible slicing
changes to the matrix sparsity structure are efficient
...
Intended Usage
LIL is a convenient format for constructing sparse matrices
...
dok_matrix also implements indexing.
The underlying data structure for coo_matrix is easy to understand. It is essentially the parameters for coo_matrix((data, (i, j)), [shape=(M, N)]) definition. To create the same matrix you could use:
sparse.coo_matrix(([6],([5],[7])), shape=(10,10))
If you have more assignments, build larger data, i, j lists (or 1d arrays), and when complete construct the sparse matrix.
The documentation for bsr is here bsr matrix and for csr is here csr matrix. It might be worth it to understand the csr before moving to the bsr. The only difference is that bsr has entries that are matrices themselves whereas the basic unit in a csr is a scalar.
I don't know if there are super easy ways to manipulate the matrices once they are created, but here are some examples of what you're trying to do,
import numpy as np
from scipy.sparse import bsr_matrix, csr_matrix
row = np.array( [5] )
col = np.array( [7] )
data = np.array( [6] )
A = csr_matrix( (data,(row,col)) )
This is a straightforward syntax in which you list all the data you want in the matrix in the array data and then specify where that data should go using row and col. Note that this will make the matrix dimensions just big enough to hold the element in the largest row and column ( in this case a 6x8 matrix ). You can see the matrix in standard form using the todense() method.
A.todense()
However, you cannot manipulate the matrix on the fly using this pattern. What you can do is modify the native scipy representation of the matrix. This involves 3 attributes, indices, indptr, and data. To start with, we can examine the value of these attributes for the array we've already created.
>>> print A.data
array([6])
>>> print A.indices
array([7], dtype=int32)
>>> print A.indptr
array([0, 0, 0, 0, 0, 0, 1], dtype=int32)
data is the same thing it was before, a 1-d array of values we want in the matrix. The difference is that the position of this data is now specified by indices and indptr instead of row and col. indices is fairly straightforward. It simply a list of which column each data entry is in. It will always be the same size and the data array. indptr is a little trickier. It lets the data structure know what row each data entry is in. To quote from the docs,
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]]
From this definition we can see that the size of indptr will always be the number of rows in the matrix + 1. It takes a little while to get used to it, but working through the values for each row will give you some intuition. Note that all the entries are zero until the last one. That means that the column indices for rows i=0-4 are going to be stored in indices[0:0] i.e. the empty array. This is because these rows are all zeros. Finally, on the last row, i=5 we get indices[0:1]=7 which tells us the data entry(ies) data[0:1] are in row 5, column 7.
Now suppose we wanted to add the value 10 at row 2 column 4. We first put it into the data attribute,
A.data = np.array( [10,6] )
next we update indices to indicate the column 10 will be in,
A.indices = np.array( [4,7], dtype=np.int32 )
and finally we indicate which row it will be in by modifying indptr
A.indptr = np.array( [0,0,0,1,1,1,2], dtype=np.int32 )
It is important that you make the data type of indices and indptr np.int32. One way to visualize what's going in in indptr is that the change in numbers occurs as you move from i to i+1 of a row that has data. Also note that arrays like these can be used to construct sparse matrices
B = csr_matrix( (data,indices,indptr) )
It would be nice if it was as easy as simply indexing into the array as you tried, but the implementation is not there yet. That should be enough to get you started at least.