scipy.linalg.norm different from sklearn.preprocessing.normalize? - python

from numpy.random import rand
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix
from scipy.linalg import norm
w = (rand(1,10)<0.25)*rand(1,10)
x = (rand(1,10)<0.25)*rand(1,10)
w_csr = csr_matrix(w)
x_csr = csr_matrix(x)
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
norm(w,ord='fro')*norm(x,ord='fro')
I am working with scipy csr_matrix and would like to normalize two matrices using the frobenius norm and get their product. But norm from scipy.linalg and normalize from sklearn.preprocessing seem to evaluate the matrices differently. Since technically in the above two cases I am calculating the same frobenius norm shouldn't the two expressions evaluate to the same thing? But I get the following answer:
matrix([[ 0.962341]])
0.4431811178371029
for sklearn.preprocessing and scipy.linalg.norm respectively. I am really interested to know what I am doing wrong.

sklearn.prepocessing.normalize divides each row by its norm. It returns a matrix with the same shape as its input. scipy.linalg.norm returns the norm of the matrix. So your calculations are not equivalent.
Note that your code is not correct as it is written. This line
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
raises ValueError: dimension mismatch. The two calls to normalize both return matrices with shapes (1, 10), so their dimensions are not compatible for a matrix product. What did you do to get matrix([[ 0.962341]])?
Here's a simple function to compute the Frobenius norm of a sparse (e.g. CSR or CSC) matrix:
def spnorm(a):
return np.sqrt(((a.data**2).sum()))
For example,
In [182]: b_csr
Out[182]:
<3x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [183]: b_csr.A
Out[183]:
array([[ 1., 0., 0., 0., 0.],
[ 0., 2., 0., 4., 0.],
[ 0., 0., 0., 2., 1.]])
In [184]: spnorm(b_csr)
Out[184]: 5.0990195135927845
In [185]: norm(b_csr.A)
Out[185]: 5.0990195135927845

Related

numpy functions that can/cannot be used on a compressed sparse row (CSR) matrix

I am a newbie in Python and I have a (probably very naive) question. I have a CSR (compressed sparse row) matrix to work on (let's name it M), and looks like some functions that are designed for a 2d numpy array manipulation work for my matrix while some others do not.
For example, numpy.sum(M, axis=0) works fine while numpy.diagonal(M) gives an error saying {ValueError}diag requires an array of at least two dimensions.
So is there a rationale behind why one matrix function works on M while the other does not?
And a bonus question is, how to get the diagonal elements from a CSR matrix given the above numpy.diagonal does not work for it?
The code for np.diagonal is:
return asanyarray(a).diagonal(offset=offset, axis1=axis1, axis2=axis2)
That is, it first tries to turn the argument into an array, for example if it is a list of lists. But that isn't the right way to turn a sparse matrix into a ndarray.
In [33]: from scipy import sparse
In [34]: M = sparse.csr_matrix(np.eye(3))
In [35]: M
Out[35]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [36]: M.A # right
Out[36]:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
In [37]: np.asanyarray(M) # wrong
Out[37]:
array(<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>, dtype=object)
The correct way to use np.diagonal is:
In [38]: np.diagonal(M.A)
Out[38]: array([1., 1., 1.])
But no need for that. M already has a diagonal method:
In [39]: M.diagonal()
Out[39]: array([1., 1., 1.])
np.sum does work, because it delegates the action to a method (look at its code):
In [40]: M.sum(axis=0)
Out[40]: matrix([[1., 1., 1.]])
In [41]: np.sum(M, axis=0)
Out[41]: matrix([[1., 1., 1.]])
As a general rule, try to use sparse functions and methods on sparse matrices. Don't count on numpy functions to work right. sparse is built on numpy, but numpy does not 'know' about sparse.

Converting from sparse to dense to sparse again decreases density after constructing sparse matrix

I am using scipy to generate a sparse finite difference matrix, constructing it initially from block matrices and then editing the diagonal to account for boundary conditions. The resulting sparse matrix is of the BSR type. I have found that if I convert the matrix to a dense matrix and then back to a sparse matrix using the scipy.sparse.BSR_matrix function, I am left with a sparser matrix than before. Here is the code I use to generate the matrix:
size = (4,4)
xDiff = np.zeros((size[0]+1,size[0]))
ix,jx = np.indices(xDiff.shape)
xDiff[ix==jx] = 1
xDiff[ix==jx+1] = -1
yDiff = np.zeros((size[1]+1,size[1]))
iy,jy = np.indices(yDiff.shape)
yDiff[iy==jy] = 1
yDiff[iy==jy+1] = -1
Ax = sp.sparse.dia_matrix(-np.matmul(np.transpose(xDiff),xDiff))
Ay = sp.sparse.dia_matrix(-np.matmul(np.transpose(yDiff),yDiff))
lap = sp.sparse.kron(sp.sparse.eye(size[1]),Ax) + sp.sparse.kron(Ay,sp.sparse.eye(size[0]))
#set up boundary conditions
BC_diag = np.array([2]+[1]*(size[0]-2)+[2]+([1]+[0]*(size[0]-2)+[1])*(size[1]-2)+[2]+[1]*(size[0]-2)+[2])
lap += sp.sparse.diags(BC_diag)
If I check the sparsity of this matrix I see the following:
lap
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements (blocksize = 4x4) in Block Sparse Row format>
However, if I convert it to a dense matrix and then back to the same sparse format I see a much sparser matrix:
sp.sparse.bsr_matrix(lap.todense())
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements (blocksize = 1x1) in Block Sparse Row format>
I suspect that the reason this is happening is because I constructed the matrix using the sparse.kron function but my question is if there is a way to arrive at the smaller sparse matrix without converting to dense first, for example if I end up wanting to simulate a very large domain.
BSR stores the data in dense blocks:
In [167]: lap.data.shape
Out[167]: (10, 4, 4)
In this case those blocks have quite a few zeros.
In [168]: lap1 = lap.tocsr()
In [170]: lap1
Out[170]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements in Compressed Sparse Row format>
In [171]: lap1.data
Out[171]:
array([-2., 1., 0., 0., 1., 0., 0., 0., 1., -3., 1., 0., 0.,
1., 0., 0., 0., 1., -3., 1., 0., 0., 1., 0., 0., 0.,
1., -2., 0., 0., 0., 1., 1., 0., 0., 0., -3., 1., 0.,
0., 1., 0., 0., 0., 0., 1., 0., 0., 1., -4., 1., 0.,
...
0., 0., 1., -2.])
In place cleanup:
In [172]: lap1.eliminate_zeros()
In [173]: lap1
Out[173]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
If I specify the csr format when using kron:
In [181]: lap2 = sparse.kron(np.eye(size[1]),Ax,format='csr') + sparse.kron(Ay,n
...: p.eye(size[0]), format='csr')
In [182]: lap2
Out[182]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
[I have been informed that my answer is incorrect. The reason, if I understand, is that Scipy is not using Lapack for creating matrices but is using its own code for this purpose. Interesting. The information though unexpected has the ring of authority. I shall defer to it!
[I will leave the answer posted for reference, but no longer assert that the answer were correct.]
Generally speaking, when it comes to complicated data structures like sparse matrices, you have two cases:
the constructor knows the structure's full contents in advance; or
the structure is designed to be built up gradually so that the structure's full contents are known only after the structure is complete.
The classic case of the complicated data structure is the case of the binary tree. You can make a binary tree more efficient by copying it after it is complete. Otherwise, the standard red-black implementation of the tree leaves some search paths as long as twice as long as others—which is usually okay but is not optimal.
Now, you probably knew all that, but I mention it for a reason. Scipy depends on Lapack. Lapack brings several different storage schemes. Two of these are the
general sparse and
banded
schemes. It would appear that Scipy begins by storing your matrix as sparse, where the indices of each nonzero element are explicitly stored; but that, on copy, Scipy notices that the banded representation is the more appropriate—for your matrix is, after all, banded.

Normalization of a matrix

I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?
You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()
I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.

Normalise 2D Numpy Array: Zero Mean Unit Variance

I have a 2D Numpy array, in which I want to normalise each column to zero mean and unit variance. Since I'm primarily used to C++, the method in which I'm doing is to use loops to iterate over elements in a column and do the necessary operations, followed by repeating this for all columns. I wanted to know about a pythonic way to do so.
Let class_input_data be my 2D array. I can get the column mean as:
column_mean = numpy.sum(class_input_data, axis = 0)/class_input_data.shape[0]
I then subtract the mean from all columns by:
class_input_data = class_input_data - column_mean
By now, the data should be zero mean. However, the value of:
numpy.sum(class_input_data, axis = 0)
isn't equal to 0, implying that I have done something wrong in my normalisation. By isn't equal to 0, I don't mean very small numbers which can be attributed to floating point inaccuracies.
Something like:
import numpy as np
eg_array = 5 + (np.random.randn(10, 10) * 2)
normed = (eg_array - eg_array.mean(axis=0)) / eg_array.std(axis=0)
normed.mean(axis=0)
Out[14]:
array([ 1.16573418e-16, -7.77156117e-17, -1.77635684e-16,
9.43689571e-17, -2.22044605e-17, -6.09234885e-16,
-2.22044605e-16, -4.44089210e-17, -7.10542736e-16,
4.21884749e-16])
normed.std(axis=0)
Out[15]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Eigenvectors computed with numpy's eigh and svd do not match

Consider singular value decomposition M=USV*. Then the eigenvalue decomposition of M* M gives M* M= V (S* S) V*=VS* U* USV*. I wish to verify this equality with numpy by showing that the eigenvectors returned by eigh function are the same as those returned by svd function:
import numpy as np
np.random.seed(42)
# create mean centered data
A=np.random.randn(50,20)
M= A-np.array(A.mean(0),ndmin=2)
# svd
U1,S1,V1=np.linalg.svd(M)
S1=np.square(S1)
V1=V1.T
# eig
S2,V2=np.linalg.eigh(np.dot(M.T,M))
indx=np.argsort(S2)[::-1]
S2=S2[indx]
V2=V2[:,indx]
# both Vs are in orthonormal form
assert np.all(np.isclose(np.linalg.norm(V1,axis=1), np.ones(V1.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V1,axis=0), np.ones(V1.shape[1])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=1), np.ones(V2.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=0), np.ones(V2.shape[1])))
assert np.all(np.isclose(S1,S2))
assert np.all(np.isclose(V1,V2))
The last assertion fails. Why?
Just play with small numbers to debug your problem.
Start with A=np.random.randn(3,2) instead of your much larger matrix with size (50,20)
In my random case, I find that
v1 = array([[-0.33872745, 0.94088454],
[-0.94088454, -0.33872745]])
and for v2:
v2 = array([[ 0.33872745, -0.94088454],
[ 0.94088454, 0.33872745]])
they only differ for a sign, and obviously, even if normalized to have unit module, the vector can differ for a sign.
Now if you try the trick
assert np.all(np.isclose(V1,-1*V2))
for your original big matrix, it fails... again, this is OK. What happens is that some vectors have been multiplied by -1, some others haven't.
A correct way to check for equality between the vectors is:
assert allclose(abs((V1*V2).sum(0)),1.)
and indeed, to get a feeling of how this works you can print this quantity:
(V1*V2).sum(0)
that indeed is either +1 or -1 depending on the vector:
array([ 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., -1., 1., 1., 1., -1., -1.])
EDIT: This will happen in most cases, especially if starting from a random matrix. Notice however that this test will likely fail if one or more eigenvalues has an eigenspace of dimension larger than 1, as pointed out by #Sven Marnach in his comment below:
There might be other differences than just vectors multiplied by -1.
If any of the eigenvalues has a multi-dimensional eigenspace, you
might get an arbitrary orthonormal basis of that eigenspace, and to
such bases might be rotated against each other by an arbitraty
unitarian matrix

Categories