[Short version]
Is there an equivalent to numpy.diagflat() in scipy.sparse? Or any way to 'flatten' a sparse matrix made dense?
[Long version]
I have a sparse matrix (mathematically a vector), x_f, that I need to diagonalise (i.e. create a square matrix with the values of the x_f vector on the diagonal).
x_f
Out[59]:
<35021x1 sparse matrix of type '<class 'numpy.float64'>'
with 47 stored elements in Compressed Sparse Row format>
I've tried 'diags' from the scipy.sparse module. (I've also tried 'spdiags', but it's just a more fancy version of 'diags', which I don't need.) I've tried it with every combination of [csr or csc format], [original or transposed vector] and [.todense() or .toarray()], but I keep getting the error:
ValueError: Different number of diagonals and offsets.
With sparse.diags the default offset is 0, and what I'm trying to do is to only put numbers on the main diagonal (which is the default), so getting this error means it's not working as I want it to.
Here are examples of the original and transposed vector with .todense() and .toarray() respectively:
x_f_original.todense()
Out[72]:
matrix([[ 0.00000000e+00],
[ 0.00000000e+00],
[ 0.00000000e+00],
...,
[ 0.00000000e+00],
[ 1.03332178e-17],
[ 0.00000000e+00]])
x_f_transposed.toarray()
Out[83]:
array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 1.03332178e-17, 0.00000000e+00]])
The following code works, but takes about 15 seconds to run:
x_f_diag = sparse.csc_matrix(np.diagflat(x_f.todense()))
Does anyone have any ideas of how to make it more efficient or just a better way to do this?
[Disclaimer]
This is my first question here. I hope I did it right and apologise for anything that's unclear.
In [106]: x_f = sparse.random(1000,1, .1, 'csr')
In [107]: x_f
Out[107]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
I can use it in sparse.diags if turn it into a 1d dense array.
In [108]: M1=sparse.diags(x_f.A.ravel()).tocsr()
In [109]: M1
Out[109]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
Or I can make it a (1,1000) matrix, and use a list as the offset:
In [110]: M2=sparse.diags(x_f.T.A,[0]).tocsr()
In [111]: M2
Out[111]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
diags takes a dense diagonal, not a sparse. This is stored as is, so I have use the further .tocsr to remove 0s etc.
In [113]: sparse.diags(x_f.T.A,[0])
Out[113]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements (1 diagonals) in DIAgonal format>
So either way I am matching the shape of diagonal with the number of offsets (scalar or 1).
A direct mapping to csr (or csc) is probably faster.
With this column shape, the indices attribute doesn't tell us anything.
In [125]: x_f.indices
Out[125]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...0, 0, 0], dtype=int32)
but transform that to csc (this maps the indptr onto indices)
In [126]: x_f.tocsc().indices
Out[126]:
array([ 2, 15, 26, 32, 47, 56, 75, 82, 96, 99, 126, 133, 136,
141, 145, 149, ... 960, 976], dtype=int32)
In [127]: idx=x_f.tocsc().indices
In [128]: M3 = sparse.csr_matrix((x_f.data, (idx, idx)),(1000,1000))
In [129]: M3
Out[129]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
you can use the following constructor:
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
where data, row_ind and col_ind satisfy the relationship
a[row_ind[k], col_ind[k]] = data[k].
Demo (COO Matrix):
from scipy.sparse import random, csr_matrix, coo_matrix
In [142]: M = random(10000, 1, .005, 'coo')
In [143]: M
Out[143]:
<10000x1 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in COOrdinate format>
In [144]: M2 = coo_matrix((M.data, np.diag_indices(len(M.data))), (len(M.data), len(M.data)))
In [145]: M2
Out[145]:
<50x50 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in COOrdinate format>
In [146]: M2.todense()
Out[146]:
matrix([[ 0.1559936 , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.28984266, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.21381431, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.23100531, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.13789309, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.73827 ]])
Demo (CSR matrix):
In [112]: from scipy.sparse import random, csr_matrix
In [113]: M = random(10000, 1, .005, 'csr')
In [114]: M
Out[114]:
<10000x1 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in Compressed Sparse Row format>
In [137]: M2 = csr_matrix((M.data, np.diag_indices(len(M.data))), (len(M.data), len(M.data)))
In [138]: M2
Out[138]:
<50x50 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in Compressed Sparse Row format>
In [139]: M2.todense()
Out[139]:
matrix([[ 0.45661992, 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.42428401, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.99484544, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.80880579, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.46292628, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.56363196]])
If you need a dense matrix:
In [147]: np.diagflat(M.data)
Out[147]:
array([[ 0.1559936 , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.28984266, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.21381431, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.23100531, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.13789309, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.73827 ]])
Related
I want to add a value to each non-zero element in my sparse matrix. Can someone give me a method to do that.
y=sparse.csc_matrix((df[column_name].values,(df['user_id'].values, df['anime_id'].values)),shape=(rows, cols))
x=np.random.laplace(0,scale)
y=y+x
The above code is giving me an error.
Offered without comment:
In [166]: from scipy import sparse
In [167]: M = sparse.random(5,5,.2,'csc')
In [168]: M
Out[168]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Column format>
In [169]: M.A
Out[169]:
array([[0.24975586, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.6863175 , 0. ],
[0.43488131, 0.19245474, 0.26190903, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ]])
In [171]: x=np.random.laplace(0,10)
In [172]: x
Out[172]: 0.4773577605565098
In [173]: M+x
Traceback (most recent call last):
Input In [173] in <cell line: 1>
M+x
File /usr/local/lib/python3.8/dist-packages/scipy/sparse/_base.py:464 in __add__
raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
This is the error message you should have shown initially.
In [174]: M.data += x
In [175]: M.A
Out[175]:
array([[0.72711362, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 1.16367526, 0. ],
[0.91223907, 0.6698125 , 0.73926679, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ]])
This question already has answers here:
Numpy array loss of dimension when masking
(5 answers)
Closed 3 years ago.
The question sounds very basic. But when I try to use where or boolean conditions on numpy arrays, it always returns a flattened array.
I have the NumPy array
P = array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
I want to extract the array of only negative values, but when I try
P[P<0]
array([-0.19012371, -0.41421612, -0.20315014, -0.56598029, -0.21166188,
-0.08773241, -0.09241335])
P[np.where(P<0)]
array([-0.19012371, -0.41421612, -0.20315014, -0.56598029, -0.21166188,
-0.08773241, -0.09241335])
I get a flattened array. How can I extract the array of the form
array([[ 0, 0, -0.19012371],
[ 0 , 0, -0.20315014],
[ 0, 0, -0.56598029],
[ 0, -0.21166188, -0.08773241]])
I do not wish to create a temp array and then use something like Temp[Temp>=0] = 0
Since your need is:
I want to "extract" the array of only negative values
You can use numpy.where() with your condition (checking for negative values), which can preserve the dimension of the array, as in the below example:
In [61]: np.where(P<0, P, 0)
Out[61]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
where P is your input array.
Another idea could be to use numpy.zeros_like() for initializing a same shape array and numpy.where() to gather the indices at which our condition satisfies.
# initialize our result array with zeros
In [106]: non_positives = np.zeros_like(P)
# gather the indices where our condition is obeyed
In [107]: idxs = np.where(P < 0)
# copy the negative values to correct indices
In [108]: non_positives[idxs] = P[idxs]
In [109]: non_positives
Out[109]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
Yet another idea would be to simply use the barebones numpy.clip() API, which would return a new array, if we omit the out= kwarg.
In [22]: np.clip(P, -np.inf, 0) # P.clip(-np.inf, 0)
Out[22]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
This should work, essentially get the indexes of all elements which are above 0, and set them to 0, this will preserve the dimensions! I got the idea from here: Replace all elements of Python NumPy Array that are greater than some value
Also note that I have modified the original array, I haven't used a temp array here
import numpy as np
P = np.array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
P[P >= 0] = 0
print(P)
The output will be
[[ 0. 0. -0.19012371]
[ 0. 0. -0.20315014]
[ 0. 0. -0.56598029]
[ 0. -0.21166188 -0.08773241]]
As noted below, this will modify the array, so we should use np.where(P<0, P 0) to preserve the original array as follows, thanks #kmario123 as follows
import numpy as np
P = np.array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
print( np.where(P<0, P, 0))
print(P)
The output will be
[[ 0. 0. -0.19012371]
[ 0. 0. -0.20315014]
[ 0. 0. -0.56598029]
[ 0. -0.21166188 -0.08773241]]
[[ 0.49530662 0.07901 -0.19012371]
[ 0.1421513 0.48607405 -0.20315014]
[ 0.76467375 0.16479826 -0.56598029]
[ 0.53530718 -0.21166188 -0.08773241]]
Python version: 2.7
I have the following numpy 2d array:
array([[ -5.05000000e+01, -1.05000000e+01],
[ -4.04000000e+01, -8.40000000e+00],
[ -3.03000000e+01, -6.30000000e+00],
[ -2.02000000e+01, -4.20000000e+00],
[ -1.01000000e+01, -2.10000000e+00],
[ 7.10542736e-15, -1.77635684e-15],
[ 1.01000000e+01, 2.10000000e+00],
[ 2.02000000e+01, 4.20000000e+00],
[ 3.03000000e+01, 6.30000000e+00],
[ 4.04000000e+01, 8.40000000e+00]])
If I wanted to find all the combinations of the first and the second columns, I would use np.array(np.meshgrid(first_column, second_column)).T.reshape(-1,2). As a result, I would get a 100*1 matrix with 10*10 = 100 data points. However, my matrix can have 3, 4, or more columns, so I have a problem of using this numpy function.
Question: how can I make an automatically meshgridded matrix with 3+ columns?
UPD: for example, I have the initial array:
[[-50.5 -10.5]
[ 0. 0. ]]
As a result, I want to have the output array like this:
array([[-10.5, -50.5],
[-10.5, 0. ],
[ 0. , -50.5],
[ 0. , 0. ]])
or this:
array([[-50.5, -10.5],
[-50.5, 0. ],
[ 0. , -10.5],
[ 0. , 0. ]])
You could use * operator on the transposed array version that unpacks those columns sequentially. Finally, a swap axes operation is needed to merge the output grid arrays as one array.
Thus, one generic solution would be -
np.swapaxes(np.meshgrid(*arr.T),0,2)
Sample run -
In [44]: arr
Out[44]:
array([[-50.5, -10.5],
[ 0. , 0. ]])
In [45]: np.swapaxes(np.meshgrid(*arr.T),0,2)
Out[45]:
array([[[-50.5, -10.5],
[-50.5, 0. ]],
[[ 0. , -10.5],
[ 0. , 0. ]]])
I have been trying to divide a python scipy sparse matrix by a vector sum of its rows. Here is my code
sparse_mat = bsr_matrix((l_data, (l_row, l_col)), dtype=float)
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
However, it throws an error no matter how I try it
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 381, in __div__
return self.__truediv__(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 427, in __truediv__
raise NotImplementedError
NotImplementedError
Anyone with an idea of where I am going wrong?
You can circumvent the problem by creating a sparse diagonal matrix from the reciprocals of your row sums and then multiplying it with your matrix. In the product the diagonal matrix goes left and your matrix goes right.
Example:
>>> a
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
>>> b = sparse.bsr_matrix(a)
>>>
>>> c = sparse.diags(1/b.sum(axis=1).A.ravel())
>>> # on older scipy versions the offsets parameter (default 0)
... # is a required argument, thus
... # c = sparse.diags(1/b.sum(axis=1).A.ravel(), 0)
...
>>> a/a.sum(axis=1, keepdims=True)
array([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
>>> (c # b).todense() # on Python < 3.5 replace c # b with c.dot(b)
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
Something funny is going on. I have no problem performing the element division. I wonder if it's a Py2 issue. I'm using Py3.
In [1022]: A=sparse.bsr_matrix([[2,4],[1,2]])
In [1023]: A
Out[1023]:
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [1024]: A.A
Out[1024]:
array([[2, 4],
[1, 2]], dtype=int32)
In [1025]: A.sum(axis=1)
Out[1025]:
matrix([[6],
[3]], dtype=int32)
In [1026]: A/A.sum(axis=1)
Out[1026]:
matrix([[ 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667]])
or to try the other example:
In [1027]: b=sparse.bsr_matrix([[0, 9, 0, 0, 1, 0],
...: [2, 0, 5, 0, 0, 9],
...: [0, 2, 0, 0, 0, 0],
...: [2, 0, 0, 0, 0, 0],
...: [0, 9, 5, 3, 0, 7],
...: [1, 0, 0, 8, 9, 0]])
In [1028]: b
Out[1028]:
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 14 stored elements (blocksize = 1x1) in Block Sparse Row format>
In [1029]: b.sum(axis=1)
Out[1029]:
matrix([[10],
[16],
[ 2],
[ 2],
[24],
[18]], dtype=int32)
In [1030]: b/b.sum(axis=1)
Out[1030]:
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
....
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
The result of this sparse/dense is also dense, where as the c*b (c is the sparse diagonal) is sparse.
In [1039]: c*b
Out[1039]:
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 14 stored elements in Compressed Sparse Row format>
The sparse sum is a dense matrix. It is 2d, so there's no need to expand it dimensions. In fact if I try that I get an error:
In [1031]: A/(A.sum(axis=1)[:,None])
....
ValueError: shape too large to be a matrix.
Per this message, to keep the matrix sparse, you access the data values and use the (nonzero) indices:
sums = np.asarray(A.sum(axis=1)).squeeze() # this is dense
A.data /= sums[A.nonzero()[0]]
If dividing by the nonzero row mean instead of the sum, one can
nnz = A.getnnz(axis=1) # this is also dense
means = sums / nnz
A.data /= means[A.nonzero()[0]]
When using scipy.sparse.spdiags or scipy.sparse.diags I have noticed want I consider to be a bug in the routines eg
scipy.sparse.spdiags([1.1,1.2,1.3],1,4,4).toarray()
returns
array([[ 0. , 1.2, 0. , 0. ],
[ 0. , 0. , 1.3, 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
That is for positive diagonals it drops the first k data. One might argue that there is some grand programming reason for this and that I just need to pad with zeros. OK annoying as that may be, one can use scipy.sparse.diags which gives the correct result. However this routine has a bug that can't be worked around
scipy.sparse.diags([1.1,1.2],0,(4,2)).toarray()
gives
array([[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ],
[ 0. , 0. ]])
nice, and
scipy.sparse.diags([1.1,1.2],-2,(4,2)).toarray()
gives
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2]])
but
scipy.sparse.diags([1.1,1.2],-1,(4,2)).toarray()
gives an error saying ValueError: Diagonal length (index 0: 2 at offset -1) does not agree with matrix size (4, 2). Obviously the answer is
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
and for extra random behaviour we have
scipy.sparse.diags([1.1],-1,(4,2)).toarray()
giving
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.1],
[ 0. , 0. ]])
Anyone know if there is a function for constructing diagonal sparse matrices that actually works?
Executive summary: spdiags works correctly, even if the matrix input isn't the most intuitive. diags has a bug that affects some offsets in rectangular matrices. There is a bug fix on scipy github.
The example for spdiags is:
>>> data = array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
>>> diags = array([0,-1,2])
>>> spdiags(data, diags, 4, 4).todense()
matrix([[1, 0, 3, 0],
[1, 2, 0, 4],
[0, 2, 3, 0],
[0, 0, 3, 4]])
Note that the 3rd column of data always appears in the 3rd column of the sparse. The other columns also line up. But they are omitted where they 'fall off the edge'.
The input to this function is a matrix, while the input to diags is a ragged list. The diagonals of the sparse matrix all have different numbers of values. So the specification has to accomodate this in one or other. spdiags does this by ignoring some values, diags by taking a list input.
The sparse.diags([1.1,1.2],-1,(4,2)) error is puzzling.
the spdiags equivalent does work:
In [421]: sparse.spdiags([[1.1,1.2]],-1,4,2).A
Out[421]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
The error is raised in this block of code:
for j, diagonal in enumerate(diagonals):
offset = offsets[j]
k = max(0, offset)
length = min(m + offset, n - offset)
if length <= 0:
raise ValueError("Offset %d (index %d) out of bounds" % (offset, j))
try:
data_arr[j, k:k+length] = diagonal
except ValueError:
if len(diagonal) != length and len(diagonal) != 1:
raise ValueError(
"Diagonal length (index %d: %d at offset %d) does not "
"agree with matrix size (%d, %d)." % (
j, len(diagonal), offset, m, n))
raise
The actual matrix constructor in the diags is:
dia_matrix((data_arr, offsets), shape=(m, n))
This is the same constructor that spdiags uses, but without any manipulation.
In [434]: sparse.dia_matrix(([[1.1,1.2]],-1),shape=(4,2)).A
Out[434]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
In dia format, the inputs are stored exactly as given by spdiags (complete with that matrix with extra values):
In [436]: M.data
Out[436]: array([[ 1.1, 1.2]])
In [437]: M.offsets
Out[437]: array([-1], dtype=int32)
As #user2357112 points out, length = min(m + offset, n - offset is wrong, producing 3 in the test case. Changing it to length = min(m + k, n - k) makes all cases for this (4,2) matrix work. But it fails with the transpose: diags([1.1,1.2], 1, (2, 4))
The correction, as of Oct 5, for this issue is:
https://github.com/pv/scipy-work/commit/529cbde47121c8ed87f74fa6445c05d71353eb6c
length = min(m + offset, n - offset, min(m,n))
With this fix, diags([1.1,1.2], 1, (2, 4)) works.