I have been trying to divide a python scipy sparse matrix by a vector sum of its rows. Here is my code
sparse_mat = bsr_matrix((l_data, (l_row, l_col)), dtype=float)
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
However, it throws an error no matter how I try it
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 381, in __div__
return self.__truediv__(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 427, in __truediv__
raise NotImplementedError
NotImplementedError
Anyone with an idea of where I am going wrong?
You can circumvent the problem by creating a sparse diagonal matrix from the reciprocals of your row sums and then multiplying it with your matrix. In the product the diagonal matrix goes left and your matrix goes right.
Example:
>>> a
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
>>> b = sparse.bsr_matrix(a)
>>>
>>> c = sparse.diags(1/b.sum(axis=1).A.ravel())
>>> # on older scipy versions the offsets parameter (default 0)
... # is a required argument, thus
... # c = sparse.diags(1/b.sum(axis=1).A.ravel(), 0)
...
>>> a/a.sum(axis=1, keepdims=True)
array([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
>>> (c # b).todense() # on Python < 3.5 replace c # b with c.dot(b)
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
Something funny is going on. I have no problem performing the element division. I wonder if it's a Py2 issue. I'm using Py3.
In [1022]: A=sparse.bsr_matrix([[2,4],[1,2]])
In [1023]: A
Out[1023]:
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [1024]: A.A
Out[1024]:
array([[2, 4],
[1, 2]], dtype=int32)
In [1025]: A.sum(axis=1)
Out[1025]:
matrix([[6],
[3]], dtype=int32)
In [1026]: A/A.sum(axis=1)
Out[1026]:
matrix([[ 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667]])
or to try the other example:
In [1027]: b=sparse.bsr_matrix([[0, 9, 0, 0, 1, 0],
...: [2, 0, 5, 0, 0, 9],
...: [0, 2, 0, 0, 0, 0],
...: [2, 0, 0, 0, 0, 0],
...: [0, 9, 5, 3, 0, 7],
...: [1, 0, 0, 8, 9, 0]])
In [1028]: b
Out[1028]:
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 14 stored elements (blocksize = 1x1) in Block Sparse Row format>
In [1029]: b.sum(axis=1)
Out[1029]:
matrix([[10],
[16],
[ 2],
[ 2],
[24],
[18]], dtype=int32)
In [1030]: b/b.sum(axis=1)
Out[1030]:
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
....
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
The result of this sparse/dense is also dense, where as the c*b (c is the sparse diagonal) is sparse.
In [1039]: c*b
Out[1039]:
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 14 stored elements in Compressed Sparse Row format>
The sparse sum is a dense matrix. It is 2d, so there's no need to expand it dimensions. In fact if I try that I get an error:
In [1031]: A/(A.sum(axis=1)[:,None])
....
ValueError: shape too large to be a matrix.
Per this message, to keep the matrix sparse, you access the data values and use the (nonzero) indices:
sums = np.asarray(A.sum(axis=1)).squeeze() # this is dense
A.data /= sums[A.nonzero()[0]]
If dividing by the nonzero row mean instead of the sum, one can
nnz = A.getnnz(axis=1) # this is also dense
means = sums / nnz
A.data /= means[A.nonzero()[0]]
Related
Given a 5x4 matrix A =
A piece of python code to construct the matrix
A = np.array([[1, 0, 0, 0],
[0, 0, 0, 4],
[0, 3, 0, 0],
[0, 0, 0, 0],
[2, 0, 0, 0]])
wolframalpha gives the svd result
the Vector(s) with the singular values Σ is in this form
the equivalent quantity (NumPy call it s) in the output of np.linalg.svd is in this form
[ 4. 3. 2.23606798 -0. ]
is there a way to have the quantity in output of numpy.linalg.svd shown as wolframalpha?
You can get most of the way there with diag:
>>> u, s, vh = np.linalg.svd(a)
>>> np.diag(s)
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ]])
Note that wolfram alpha is giving an extra row. Getting that is marginally more involved:
>>> sigma = np.zeros(A.shape, s.dtype)
>>> np.fill_diagonal(sigma, s)
>>> sigma
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ],
[ 0. , 0. , 0. , 0. ]])
Depending on what your goal is, removing a column from U might be a better approach than adding a row of zeros to sigma. That would look like:
>>> u, s, vh = np.linalg.svd(a, full_matrices=False)
This question already has answers here:
How to take elements along a given axis, given by their indices?
(4 answers)
indexing a numpy array with indices from another array
(1 answer)
Closed 4 years ago.
Let's say I've d1, d2 and d3 as following. t is a variable where I've combined my arrays and m contains the indices of the smallest value.
>>> d1
array([[ 0.9850916 , 0.95004463, 1.35728604, 1.18554035],
[ 0.47624542, 0.45561795, 0.6231743 , 0.94746001],
[ 0.74008166, 0. , 1.59774065, 1.00423774],
[ 0.86173439, 0.70940862, 1.0601817 , 0.96112015],
[ 1.03413477, 0.64874991, 1.27488263, 0.80250053]])
>>> d2
array([[ 0.27301946, 0.38387185, 0.93215524, 0.98851404],
[ 0.17996978, 0. , 0.41283798, 0.15204035],
[ 0.10952115, 0.45561795, 0.5334015 , 0.75242805],
[ 0.4600214 , 0.74100962, 0.16743427, 0.36250385],
[ 0.60984208, 0.35161234, 0.44580535, 0.6713633 ]])
>>> d3
array([[ 0. , 0.19658541, 1.14605925, 1.18431945],
[ 0.10697428, 0.27301946, 0.45536417, 0.11922118],
[ 0.42153386, 0.9850916 , 0.28225364, 0.82765657],
[ 1.04940684, 1.63082272, 0.49987388, 0.38596938],
[ 0.21015723, 1.07007177, 0.22599987, 0.89288339]])
>>> t = np.array([d1, d2, d3])
>>> t
array([[[ 0.9850916 , 0.95004463, 1.35728604, 1.18554035],
[ 0.47624542, 0.45561795, 0.6231743 , 0.94746001],
[ 0.74008166, 0. , 1.59774065, 1.00423774],
[ 0.86173439, 0.70940862, 1.0601817 , 0.96112015],
[ 1.03413477, 0.64874991, 1.27488263, 0.80250053]],
[[ 0.27301946, 0.38387185, 0.93215524, 0.98851404],
[ 0.17996978, 0. , 0.41283798, 0.15204035],
[ 0.10952115, 0.45561795, 0.5334015 , 0.75242805],
[ 0.4600214 , 0.74100962, 0.16743427, 0.36250385],
[ 0.60984208, 0.35161234, 0.44580535, 0.6713633 ]],
[[ 0. , 0.19658541, 1.14605925, 1.18431945],
[ 0.10697428, 0.27301946, 0.45536417, 0.11922118],
[ 0.42153386, 0.9850916 , 0.28225364, 0.82765657],
[ 1.04940684, 1.63082272, 0.49987388, 0.38596938],
[ 0.21015723, 1.07007177, 0.22599987, 0.89288339]]])
>>> m = np.argmin(t, axis=0)
>>> m
array([[2, 2, 1, 1],
[2, 1, 1, 2],
[1, 0, 2, 1],
[1, 0, 1, 1],
[2, 1, 2, 1]])
From m and t, I want to calculate the actual values as following. How do I do this? ... preferably, the efficient way?
array([ [ 0. , 0.19658541, 0.93215524, 0.98851404],
[ 0.10697428, 0. , 0.41283798, 0.11922118],
[ 0.10952115, 0. , 0.28225364, 0.75242805],
[ 0.4600214 , 0.70940862, 0.16743427, 0.36250385],
[ 0.21015723, 0.35161234, 0.22599987, 0.6713633 ]])
If only the minimum is what you needed, you can use np.min(t, axis=0)
If you want to use customary indexing, you can use choose:
m.choose(t) # This will return the same thing.
It can also be written as
np.choose(m, t)
Which returns:
array([[0. , 0.19658541, 0.93215524, 0.98851404],
[0.10697428, 0. , 0.41283798, 0.11922118],
[0.10952115, 0. , 0.28225364, 0.75242805],
[0.4600214 , 0.70940862, 0.16743427, 0.36250385],
[0.21015723, 0.35161234, 0.22599987, 0.6713633 ]])
I have following numpy array
import numpy as np
np.random.seed(20)
np.random.rand(20).reshape(5, 4)
array([[ 0.5881308 , 0.89771373, 0.89153073, 0.81583748],
[ 0.03588959, 0.69175758, 0.37868094, 0.51851095],
[ 0.65795147, 0.19385022, 0.2723164 , 0.71860593],
[ 0.78300361, 0.85032764, 0.77524489, 0.03666431],
[ 0.11669374, 0.7512807 , 0.23921822, 0.25480601]])
For each column I would like to slice it in positions:
position_for_slicing=[0, 3, 4, 4]
So I will get following array:
array([[ 0.5881308 , 0.85032764, 0.23921822, 0.81583748],
[ 0.03588959, 0.7512807 , 0 0],
[ 0.65795147, 0, 0 0],
[ 0.78300361, 0, 0 0],
[ 0.11669374, 0, 0 0]])
Is there fast way to do this ? I know I can use to do for loop for each column, but I was wondering if there is more elegant way to do this.
If "elegant" means "no loop" the following would qualify, but probably not under many other definitions (arr is your input array):
m, n = arr.shape
arrf = np.asanyarray(arr, order='F')
padded = np.r_[arrf, np.zeros_like(arrf)]
assert padded.flags['F_CONTIGUOUS']
expnd = np.lib.stride_tricks.as_strided(padded, (m, m+1, n), padded.strides[:1] + padded.strides)
expnd[:, [0,3,4,4], range(4)]
# array([[ 0.5881308 , 0.85032764, 0.23921822, 0.25480601],
# [ 0.03588959, 0.7512807 , 0. , 0. ],
# [ 0.65795147, 0. , 0. , 0. ],
# [ 0.78300361, 0. , 0. , 0. ],
# [ 0.11669374, 0. , 0. , 0. ]])
Please note that order='C' and then 'C_CONTIGUOUS' in the assertion also works. My hunch is that 'F' could be a bit faster because the indexing then operates on contiguous slices.
[Short version]
Is there an equivalent to numpy.diagflat() in scipy.sparse? Or any way to 'flatten' a sparse matrix made dense?
[Long version]
I have a sparse matrix (mathematically a vector), x_f, that I need to diagonalise (i.e. create a square matrix with the values of the x_f vector on the diagonal).
x_f
Out[59]:
<35021x1 sparse matrix of type '<class 'numpy.float64'>'
with 47 stored elements in Compressed Sparse Row format>
I've tried 'diags' from the scipy.sparse module. (I've also tried 'spdiags', but it's just a more fancy version of 'diags', which I don't need.) I've tried it with every combination of [csr or csc format], [original or transposed vector] and [.todense() or .toarray()], but I keep getting the error:
ValueError: Different number of diagonals and offsets.
With sparse.diags the default offset is 0, and what I'm trying to do is to only put numbers on the main diagonal (which is the default), so getting this error means it's not working as I want it to.
Here are examples of the original and transposed vector with .todense() and .toarray() respectively:
x_f_original.todense()
Out[72]:
matrix([[ 0.00000000e+00],
[ 0.00000000e+00],
[ 0.00000000e+00],
...,
[ 0.00000000e+00],
[ 1.03332178e-17],
[ 0.00000000e+00]])
x_f_transposed.toarray()
Out[83]:
array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 1.03332178e-17, 0.00000000e+00]])
The following code works, but takes about 15 seconds to run:
x_f_diag = sparse.csc_matrix(np.diagflat(x_f.todense()))
Does anyone have any ideas of how to make it more efficient or just a better way to do this?
[Disclaimer]
This is my first question here. I hope I did it right and apologise for anything that's unclear.
In [106]: x_f = sparse.random(1000,1, .1, 'csr')
In [107]: x_f
Out[107]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
I can use it in sparse.diags if turn it into a 1d dense array.
In [108]: M1=sparse.diags(x_f.A.ravel()).tocsr()
In [109]: M1
Out[109]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
Or I can make it a (1,1000) matrix, and use a list as the offset:
In [110]: M2=sparse.diags(x_f.T.A,[0]).tocsr()
In [111]: M2
Out[111]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
diags takes a dense diagonal, not a sparse. This is stored as is, so I have use the further .tocsr to remove 0s etc.
In [113]: sparse.diags(x_f.T.A,[0])
Out[113]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements (1 diagonals) in DIAgonal format>
So either way I am matching the shape of diagonal with the number of offsets (scalar or 1).
A direct mapping to csr (or csc) is probably faster.
With this column shape, the indices attribute doesn't tell us anything.
In [125]: x_f.indices
Out[125]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...0, 0, 0], dtype=int32)
but transform that to csc (this maps the indptr onto indices)
In [126]: x_f.tocsc().indices
Out[126]:
array([ 2, 15, 26, 32, 47, 56, 75, 82, 96, 99, 126, 133, 136,
141, 145, 149, ... 960, 976], dtype=int32)
In [127]: idx=x_f.tocsc().indices
In [128]: M3 = sparse.csr_matrix((x_f.data, (idx, idx)),(1000,1000))
In [129]: M3
Out[129]:
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
you can use the following constructor:
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
where data, row_ind and col_ind satisfy the relationship
a[row_ind[k], col_ind[k]] = data[k].
Demo (COO Matrix):
from scipy.sparse import random, csr_matrix, coo_matrix
In [142]: M = random(10000, 1, .005, 'coo')
In [143]: M
Out[143]:
<10000x1 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in COOrdinate format>
In [144]: M2 = coo_matrix((M.data, np.diag_indices(len(M.data))), (len(M.data), len(M.data)))
In [145]: M2
Out[145]:
<50x50 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in COOrdinate format>
In [146]: M2.todense()
Out[146]:
matrix([[ 0.1559936 , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.28984266, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.21381431, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.23100531, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.13789309, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.73827 ]])
Demo (CSR matrix):
In [112]: from scipy.sparse import random, csr_matrix
In [113]: M = random(10000, 1, .005, 'csr')
In [114]: M
Out[114]:
<10000x1 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in Compressed Sparse Row format>
In [137]: M2 = csr_matrix((M.data, np.diag_indices(len(M.data))), (len(M.data), len(M.data)))
In [138]: M2
Out[138]:
<50x50 sparse matrix of type '<class 'numpy.float64'>'
with 50 stored elements in Compressed Sparse Row format>
In [139]: M2.todense()
Out[139]:
matrix([[ 0.45661992, 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.42428401, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.99484544, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.80880579, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.46292628, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.56363196]])
If you need a dense matrix:
In [147]: np.diagflat(M.data)
Out[147]:
array([[ 0.1559936 , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0.28984266, 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0.21381431, ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 0.23100531, 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0.13789309, 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0.73827 ]])
When using scipy.sparse.spdiags or scipy.sparse.diags I have noticed want I consider to be a bug in the routines eg
scipy.sparse.spdiags([1.1,1.2,1.3],1,4,4).toarray()
returns
array([[ 0. , 1.2, 0. , 0. ],
[ 0. , 0. , 1.3, 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
That is for positive diagonals it drops the first k data. One might argue that there is some grand programming reason for this and that I just need to pad with zeros. OK annoying as that may be, one can use scipy.sparse.diags which gives the correct result. However this routine has a bug that can't be worked around
scipy.sparse.diags([1.1,1.2],0,(4,2)).toarray()
gives
array([[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ],
[ 0. , 0. ]])
nice, and
scipy.sparse.diags([1.1,1.2],-2,(4,2)).toarray()
gives
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2]])
but
scipy.sparse.diags([1.1,1.2],-1,(4,2)).toarray()
gives an error saying ValueError: Diagonal length (index 0: 2 at offset -1) does not agree with matrix size (4, 2). Obviously the answer is
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
and for extra random behaviour we have
scipy.sparse.diags([1.1],-1,(4,2)).toarray()
giving
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.1],
[ 0. , 0. ]])
Anyone know if there is a function for constructing diagonal sparse matrices that actually works?
Executive summary: spdiags works correctly, even if the matrix input isn't the most intuitive. diags has a bug that affects some offsets in rectangular matrices. There is a bug fix on scipy github.
The example for spdiags is:
>>> data = array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
>>> diags = array([0,-1,2])
>>> spdiags(data, diags, 4, 4).todense()
matrix([[1, 0, 3, 0],
[1, 2, 0, 4],
[0, 2, 3, 0],
[0, 0, 3, 4]])
Note that the 3rd column of data always appears in the 3rd column of the sparse. The other columns also line up. But they are omitted where they 'fall off the edge'.
The input to this function is a matrix, while the input to diags is a ragged list. The diagonals of the sparse matrix all have different numbers of values. So the specification has to accomodate this in one or other. spdiags does this by ignoring some values, diags by taking a list input.
The sparse.diags([1.1,1.2],-1,(4,2)) error is puzzling.
the spdiags equivalent does work:
In [421]: sparse.spdiags([[1.1,1.2]],-1,4,2).A
Out[421]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
The error is raised in this block of code:
for j, diagonal in enumerate(diagonals):
offset = offsets[j]
k = max(0, offset)
length = min(m + offset, n - offset)
if length <= 0:
raise ValueError("Offset %d (index %d) out of bounds" % (offset, j))
try:
data_arr[j, k:k+length] = diagonal
except ValueError:
if len(diagonal) != length and len(diagonal) != 1:
raise ValueError(
"Diagonal length (index %d: %d at offset %d) does not "
"agree with matrix size (%d, %d)." % (
j, len(diagonal), offset, m, n))
raise
The actual matrix constructor in the diags is:
dia_matrix((data_arr, offsets), shape=(m, n))
This is the same constructor that spdiags uses, but without any manipulation.
In [434]: sparse.dia_matrix(([[1.1,1.2]],-1),shape=(4,2)).A
Out[434]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
In dia format, the inputs are stored exactly as given by spdiags (complete with that matrix with extra values):
In [436]: M.data
Out[436]: array([[ 1.1, 1.2]])
In [437]: M.offsets
Out[437]: array([-1], dtype=int32)
As #user2357112 points out, length = min(m + offset, n - offset is wrong, producing 3 in the test case. Changing it to length = min(m + k, n - k) makes all cases for this (4,2) matrix work. But it fails with the transpose: diags([1.1,1.2], 1, (2, 4))
The correction, as of Oct 5, for this issue is:
https://github.com/pv/scipy-work/commit/529cbde47121c8ed87f74fa6445c05d71353eb6c
length = min(m + offset, n - offset, min(m,n))
With this fix, diags([1.1,1.2], 1, (2, 4)) works.