Python Sparse matrix access elements - python

I have a sparse matrix A in python. I m going through code of a friend of mine and he uses A[:,i:i+1].toarray().flatten() in his code. As far as I m concerned, the program worked for him. However when I try to use it, I get:
from scipy import sparse
...
diagonals = [[2] * 3, [-1] * (3-1), [-1] * (3-1)]
offsets = [0, 1, -1]
B = sparse.diags(diagonals, offsets)
A = sparse.kronsum(B,B)
...
A[:,i:i+1].toarray().flatten()
Exception:
in __getitem__
raise NotImplementedError
NotImplementedError
My question, what to I need to implement or how could I access elements of a sparse matrix. Thanks for the help.

Most likely you have a bsr format matrix, while the code you have, was implemented using an older version of scipy and returns a csr or csc matrix. I don't know a good way of tracing this.
So if we run you code on scipy 1.7.2 :
type(A)
scipy.sparse.bsr.bsr_matrix
We can access the elements by:
A = sparse.kronsum(B,B,format = "csr")
A[:,i:i+1].toarray().flatten()
array([-1., 4., -1., 0., -1., 0., 0., 0., 0.])
Or
A = sparse.kronsum(B,B)
A.tocsr()[:,i:i+1].toarray().flatten()

Related

Converting from sparse to dense to sparse again decreases density after constructing sparse matrix

I am using scipy to generate a sparse finite difference matrix, constructing it initially from block matrices and then editing the diagonal to account for boundary conditions. The resulting sparse matrix is of the BSR type. I have found that if I convert the matrix to a dense matrix and then back to a sparse matrix using the scipy.sparse.BSR_matrix function, I am left with a sparser matrix than before. Here is the code I use to generate the matrix:
size = (4,4)
xDiff = np.zeros((size[0]+1,size[0]))
ix,jx = np.indices(xDiff.shape)
xDiff[ix==jx] = 1
xDiff[ix==jx+1] = -1
yDiff = np.zeros((size[1]+1,size[1]))
iy,jy = np.indices(yDiff.shape)
yDiff[iy==jy] = 1
yDiff[iy==jy+1] = -1
Ax = sp.sparse.dia_matrix(-np.matmul(np.transpose(xDiff),xDiff))
Ay = sp.sparse.dia_matrix(-np.matmul(np.transpose(yDiff),yDiff))
lap = sp.sparse.kron(sp.sparse.eye(size[1]),Ax) + sp.sparse.kron(Ay,sp.sparse.eye(size[0]))
#set up boundary conditions
BC_diag = np.array([2]+[1]*(size[0]-2)+[2]+([1]+[0]*(size[0]-2)+[1])*(size[1]-2)+[2]+[1]*(size[0]-2)+[2])
lap += sp.sparse.diags(BC_diag)
If I check the sparsity of this matrix I see the following:
lap
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements (blocksize = 4x4) in Block Sparse Row format>
However, if I convert it to a dense matrix and then back to the same sparse format I see a much sparser matrix:
sp.sparse.bsr_matrix(lap.todense())
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements (blocksize = 1x1) in Block Sparse Row format>
I suspect that the reason this is happening is because I constructed the matrix using the sparse.kron function but my question is if there is a way to arrive at the smaller sparse matrix without converting to dense first, for example if I end up wanting to simulate a very large domain.
BSR stores the data in dense blocks:
In [167]: lap.data.shape
Out[167]: (10, 4, 4)
In this case those blocks have quite a few zeros.
In [168]: lap1 = lap.tocsr()
In [170]: lap1
Out[170]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements in Compressed Sparse Row format>
In [171]: lap1.data
Out[171]:
array([-2., 1., 0., 0., 1., 0., 0., 0., 1., -3., 1., 0., 0.,
1., 0., 0., 0., 1., -3., 1., 0., 0., 1., 0., 0., 0.,
1., -2., 0., 0., 0., 1., 1., 0., 0., 0., -3., 1., 0.,
0., 1., 0., 0., 0., 0., 1., 0., 0., 1., -4., 1., 0.,
...
0., 0., 1., -2.])
In place cleanup:
In [172]: lap1.eliminate_zeros()
In [173]: lap1
Out[173]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
If I specify the csr format when using kron:
In [181]: lap2 = sparse.kron(np.eye(size[1]),Ax,format='csr') + sparse.kron(Ay,n
...: p.eye(size[0]), format='csr')
In [182]: lap2
Out[182]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
[I have been informed that my answer is incorrect. The reason, if I understand, is that Scipy is not using Lapack for creating matrices but is using its own code for this purpose. Interesting. The information though unexpected has the ring of authority. I shall defer to it!
[I will leave the answer posted for reference, but no longer assert that the answer were correct.]
Generally speaking, when it comes to complicated data structures like sparse matrices, you have two cases:
the constructor knows the structure's full contents in advance; or
the structure is designed to be built up gradually so that the structure's full contents are known only after the structure is complete.
The classic case of the complicated data structure is the case of the binary tree. You can make a binary tree more efficient by copying it after it is complete. Otherwise, the standard red-black implementation of the tree leaves some search paths as long as twice as long as others—which is usually okay but is not optimal.
Now, you probably knew all that, but I mention it for a reason. Scipy depends on Lapack. Lapack brings several different storage schemes. Two of these are the
general sparse and
banded
schemes. It would appear that Scipy begins by storing your matrix as sparse, where the indices of each nonzero element are explicitly stored; but that, on copy, Scipy notices that the banded representation is the more appropriate—for your matrix is, after all, banded.

Normalization of a matrix

I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?
You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()
I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.

fill off diagonal of numpy array fails

I'm trying to the fill the offset diagonals of a matrix:
loss_matrix = np.zeros((125,125))
np.diagonal(loss_matrix, 3).fill(4)
ValueError: assignment destination is read-only
Two questions:
1) Without iterating over indexes, how can I set the offset diagonals of a numpy array?
2) Why is the result of np.diagonal read only? The documentation for numpy.diagonal reads: "In NumPy 1.10, it will return a read/write view and writing to the returned array will alter your original array."
np.__version__
'1.10.1'
Judging by the discussion on the NumPy issue tracker, it looks like the feature is stuck in limbo and they never got around to fixing the documentation to say it was delayed.
If you need writability, you can force it. This will only work on NumPy 1.9 and up, since np.diagonal makes a copy on lower versions:
diag = np.diagonal(loss_matrix, 3)
# It's not writable. MAKE it writable.
diag.setflags(write=True)
diag.fill(4)
In an older version, diagflat constructs an array from a diagonal.
In [180]: M=np.diagflat(np.ones(125-3)*4,3)
In [181]: M.shape
Out[181]: (125, 125)
In [182]: M.diagonal(3)
Out[182]:
array([ 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,... 4.])
In [183]: np.__version__
Out[183]: '1.8.2'
Effectively it does this (working from its Python code)
res = np.zeros((125, 125))
i = np.arange(122)
fi = i+3+i*125
res.flat[fi] = 4
That is, it finds the flatten array equivalent indices of the diagonal.
I can also get fi with:
In [205]: i=np.arange(0,122)
In [206]: np.ravel_multi_index((i,i+3),(125,125))

Eigenvectors computed with numpy's eigh and svd do not match

Consider singular value decomposition M=USV*. Then the eigenvalue decomposition of M* M gives M* M= V (S* S) V*=VS* U* USV*. I wish to verify this equality with numpy by showing that the eigenvectors returned by eigh function are the same as those returned by svd function:
import numpy as np
np.random.seed(42)
# create mean centered data
A=np.random.randn(50,20)
M= A-np.array(A.mean(0),ndmin=2)
# svd
U1,S1,V1=np.linalg.svd(M)
S1=np.square(S1)
V1=V1.T
# eig
S2,V2=np.linalg.eigh(np.dot(M.T,M))
indx=np.argsort(S2)[::-1]
S2=S2[indx]
V2=V2[:,indx]
# both Vs are in orthonormal form
assert np.all(np.isclose(np.linalg.norm(V1,axis=1), np.ones(V1.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V1,axis=0), np.ones(V1.shape[1])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=1), np.ones(V2.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=0), np.ones(V2.shape[1])))
assert np.all(np.isclose(S1,S2))
assert np.all(np.isclose(V1,V2))
The last assertion fails. Why?
Just play with small numbers to debug your problem.
Start with A=np.random.randn(3,2) instead of your much larger matrix with size (50,20)
In my random case, I find that
v1 = array([[-0.33872745, 0.94088454],
[-0.94088454, -0.33872745]])
and for v2:
v2 = array([[ 0.33872745, -0.94088454],
[ 0.94088454, 0.33872745]])
they only differ for a sign, and obviously, even if normalized to have unit module, the vector can differ for a sign.
Now if you try the trick
assert np.all(np.isclose(V1,-1*V2))
for your original big matrix, it fails... again, this is OK. What happens is that some vectors have been multiplied by -1, some others haven't.
A correct way to check for equality between the vectors is:
assert allclose(abs((V1*V2).sum(0)),1.)
and indeed, to get a feeling of how this works you can print this quantity:
(V1*V2).sum(0)
that indeed is either +1 or -1 depending on the vector:
array([ 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., -1., 1., 1., 1., -1., -1.])
EDIT: This will happen in most cases, especially if starting from a random matrix. Notice however that this test will likely fail if one or more eigenvalues has an eigenspace of dimension larger than 1, as pointed out by #Sven Marnach in his comment below:
There might be other differences than just vectors multiplied by -1.
If any of the eigenvalues has a multi-dimensional eigenspace, you
might get an arbitrary orthonormal basis of that eigenspace, and to
such bases might be rotated against each other by an arbitraty
unitarian matrix

scipy.linalg.norm different from sklearn.preprocessing.normalize?

from numpy.random import rand
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix
from scipy.linalg import norm
w = (rand(1,10)<0.25)*rand(1,10)
x = (rand(1,10)<0.25)*rand(1,10)
w_csr = csr_matrix(w)
x_csr = csr_matrix(x)
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
norm(w,ord='fro')*norm(x,ord='fro')
I am working with scipy csr_matrix and would like to normalize two matrices using the frobenius norm and get their product. But norm from scipy.linalg and normalize from sklearn.preprocessing seem to evaluate the matrices differently. Since technically in the above two cases I am calculating the same frobenius norm shouldn't the two expressions evaluate to the same thing? But I get the following answer:
matrix([[ 0.962341]])
0.4431811178371029
for sklearn.preprocessing and scipy.linalg.norm respectively. I am really interested to know what I am doing wrong.
sklearn.prepocessing.normalize divides each row by its norm. It returns a matrix with the same shape as its input. scipy.linalg.norm returns the norm of the matrix. So your calculations are not equivalent.
Note that your code is not correct as it is written. This line
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
raises ValueError: dimension mismatch. The two calls to normalize both return matrices with shapes (1, 10), so their dimensions are not compatible for a matrix product. What did you do to get matrix([[ 0.962341]])?
Here's a simple function to compute the Frobenius norm of a sparse (e.g. CSR or CSC) matrix:
def spnorm(a):
return np.sqrt(((a.data**2).sum()))
For example,
In [182]: b_csr
Out[182]:
<3x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [183]: b_csr.A
Out[183]:
array([[ 1., 0., 0., 0., 0.],
[ 0., 2., 0., 4., 0.],
[ 0., 0., 0., 2., 1.]])
In [184]: spnorm(b_csr)
Out[184]: 5.0990195135927845
In [185]: norm(b_csr.A)
Out[185]: 5.0990195135927845

Categories