Normalization of a matrix - python

I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?

You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])

The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()

I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.

Related

implementing euclidean distance based formula using numpy

I am trying to implement this formula in python using numpy
As you can see in picture above X is numpy matrix and each xi is a vector with n dimensions and C is also a numpy matrix and each Ci is vector with n dimensions too, dist(Ci,xi) is euclidean distance between these two vectors.
I implement a code in python:
value = 0
for i in range(X.shape[0]):
min_value = math.inf
#this for loop iterate k times
for j in range(C.shape[0]):
distance = (np.dot(X[i] - C[j],
X[i] - C[j])) ** .5
min_value = min(min_value, distance)
value += min_value
fitnessValue = value
But my code performance is not good enough I'am looking for faster,is there any faster way to calculate that formula in python any idea would be thankful.
Generally, loops running an important number of times should be avoided when possible in python.
Here, there exists a scipy function, scipy.spatial.distance.cdist(C, X), which computes the pairwise distance matrix between C and X. That is to say, if you call distance_matrix = scipy.spatial.distance.cdist(C, X), you have distance_matrix[i, j] = dist(C_i, X_j).
Then, for each j, you want to compute the minimum of the dist(C_i, X_j) over all i. You do not either need a loop to compute this! The function numpy.minimum does it for you, if you pass an axis argument.
And finally, the summation of all these minimum is done by calling the numpy.sum function.
This gives code much more readable and faster:
import scipy.spatial.distance
import numpy as np
def your_function(C, X):
distance_matrix = scipy.spatial.distance.cdist(C, X)
minimum = np.min(distance_matrix, axis=0)
return np.sum(minimum)
Which returns the same results as your function :)
Hope this helps!
einsum can also be called into play. Here is a simple small example of a pairwise distance calculation for a small set. Useful if you don't have scipy installed and/or wish to use numpy solely.
>>> a
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
>>> b = a.reshape(np.prod(a.shape[:-1]),1,a.shape[-1])
>>> b
array([[[ 0., 0.]],
[[ 1., 1.]],
[[ 2., 2.]],
[[ 3., 3.]],
[[ 4., 4.]]])
>>> diff = a - b; dist_arr = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff)).squeeze()
>>> dist_arr
array([[ 0. , 1.41421, 2.82843, 4.24264, 5.65685],
[ 1.41421, 0. , 1.41421, 2.82843, 4.24264],
[ 2.82843, 1.41421, 0. , 1.41421, 2.82843],
[ 4.24264, 2.82843, 1.41421, 0. , 1.41421],
[ 5.65685, 4.24264, 2.82843, 1.41421, 0. ]])
Array 'a' is a simple 2d (shape=(5,2), 'b' is just 'a' reshaped to facilitate (5, 1, 2) the difference calculations for the cdist style array. The terms are written verbosely since they are extracted from other code. the 'diff' variable is the difference array and the dist_arr shown is for the 'euclidean' distance. Should you need euclideansq (square distance) for 'closest' determinations, simply remove the np.sqrt term and finally squeeze, just removes and 1 terms in the shape.
cdist is faster for much larger arrays (in the order of 1000s of origins and destinations) but einsum is a nice alternative and well documented by others on this site.

Normalise 2D Numpy Array: Zero Mean Unit Variance

I have a 2D Numpy array, in which I want to normalise each column to zero mean and unit variance. Since I'm primarily used to C++, the method in which I'm doing is to use loops to iterate over elements in a column and do the necessary operations, followed by repeating this for all columns. I wanted to know about a pythonic way to do so.
Let class_input_data be my 2D array. I can get the column mean as:
column_mean = numpy.sum(class_input_data, axis = 0)/class_input_data.shape[0]
I then subtract the mean from all columns by:
class_input_data = class_input_data - column_mean
By now, the data should be zero mean. However, the value of:
numpy.sum(class_input_data, axis = 0)
isn't equal to 0, implying that I have done something wrong in my normalisation. By isn't equal to 0, I don't mean very small numbers which can be attributed to floating point inaccuracies.
Something like:
import numpy as np
eg_array = 5 + (np.random.randn(10, 10) * 2)
normed = (eg_array - eg_array.mean(axis=0)) / eg_array.std(axis=0)
normed.mean(axis=0)
Out[14]:
array([ 1.16573418e-16, -7.77156117e-17, -1.77635684e-16,
9.43689571e-17, -2.22044605e-17, -6.09234885e-16,
-2.22044605e-16, -4.44089210e-17, -7.10542736e-16,
4.21884749e-16])
normed.std(axis=0)
Out[15]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Eigenvectors computed with numpy's eigh and svd do not match

Consider singular value decomposition M=USV*. Then the eigenvalue decomposition of M* M gives M* M= V (S* S) V*=VS* U* USV*. I wish to verify this equality with numpy by showing that the eigenvectors returned by eigh function are the same as those returned by svd function:
import numpy as np
np.random.seed(42)
# create mean centered data
A=np.random.randn(50,20)
M= A-np.array(A.mean(0),ndmin=2)
# svd
U1,S1,V1=np.linalg.svd(M)
S1=np.square(S1)
V1=V1.T
# eig
S2,V2=np.linalg.eigh(np.dot(M.T,M))
indx=np.argsort(S2)[::-1]
S2=S2[indx]
V2=V2[:,indx]
# both Vs are in orthonormal form
assert np.all(np.isclose(np.linalg.norm(V1,axis=1), np.ones(V1.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V1,axis=0), np.ones(V1.shape[1])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=1), np.ones(V2.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=0), np.ones(V2.shape[1])))
assert np.all(np.isclose(S1,S2))
assert np.all(np.isclose(V1,V2))
The last assertion fails. Why?
Just play with small numbers to debug your problem.
Start with A=np.random.randn(3,2) instead of your much larger matrix with size (50,20)
In my random case, I find that
v1 = array([[-0.33872745, 0.94088454],
[-0.94088454, -0.33872745]])
and for v2:
v2 = array([[ 0.33872745, -0.94088454],
[ 0.94088454, 0.33872745]])
they only differ for a sign, and obviously, even if normalized to have unit module, the vector can differ for a sign.
Now if you try the trick
assert np.all(np.isclose(V1,-1*V2))
for your original big matrix, it fails... again, this is OK. What happens is that some vectors have been multiplied by -1, some others haven't.
A correct way to check for equality between the vectors is:
assert allclose(abs((V1*V2).sum(0)),1.)
and indeed, to get a feeling of how this works you can print this quantity:
(V1*V2).sum(0)
that indeed is either +1 or -1 depending on the vector:
array([ 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., -1., 1., 1., 1., -1., -1.])
EDIT: This will happen in most cases, especially if starting from a random matrix. Notice however that this test will likely fail if one or more eigenvalues has an eigenspace of dimension larger than 1, as pointed out by #Sven Marnach in his comment below:
There might be other differences than just vectors multiplied by -1.
If any of the eigenvalues has a multi-dimensional eigenspace, you
might get an arbitrary orthonormal basis of that eigenspace, and to
such bases might be rotated against each other by an arbitraty
unitarian matrix

Array of hermite values in numpy

I have a data structure that looks like a list values and I am trying to compute the (x,y) 2d hermite functions from them using numpy. I'm trying to use as many numpy arrays as possible due to the performance boost you get from getting to Fortran as quickly as possible (I'm expecting x to be in practice many thousands of 3-arrays). Specifically, my code looks like this:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
coefs = np.array([[[1., 0.],[0., 1.]], [[0., 1.], [1., 0.]]])
z = np.array([0., 0.])
z[:] = hermval2d(x[:,0], x[:,1], coefs[:])
This returns an error about the shape of hermval2d, which according to just running the hermval2d function instead of assigning it:
In [XX]: hermval2d(x[:,0], x[:,1], coefs[:])
Out[XX]:
array([[ 9., 81.],
[ 6., 18.]])
I would expect the hermval2d to be a scalar for every x, y, and coefficient matrix, which is what you would expect from the documentation. So what am I missing here? What's the score?
It's right there in the docs :)
hermval2d(x, y, c)
[...]
The shape of the result will be c.shape[2:] + x.shape
In your case this seems to return the Hermite values for x and y evaluated for each ith 2d array in c[:,:,i].

scipy.linalg.norm different from sklearn.preprocessing.normalize?

from numpy.random import rand
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix
from scipy.linalg import norm
w = (rand(1,10)<0.25)*rand(1,10)
x = (rand(1,10)<0.25)*rand(1,10)
w_csr = csr_matrix(w)
x_csr = csr_matrix(x)
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
norm(w,ord='fro')*norm(x,ord='fro')
I am working with scipy csr_matrix and would like to normalize two matrices using the frobenius norm and get their product. But norm from scipy.linalg and normalize from sklearn.preprocessing seem to evaluate the matrices differently. Since technically in the above two cases I am calculating the same frobenius norm shouldn't the two expressions evaluate to the same thing? But I get the following answer:
matrix([[ 0.962341]])
0.4431811178371029
for sklearn.preprocessing and scipy.linalg.norm respectively. I am really interested to know what I am doing wrong.
sklearn.prepocessing.normalize divides each row by its norm. It returns a matrix with the same shape as its input. scipy.linalg.norm returns the norm of the matrix. So your calculations are not equivalent.
Note that your code is not correct as it is written. This line
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
raises ValueError: dimension mismatch. The two calls to normalize both return matrices with shapes (1, 10), so their dimensions are not compatible for a matrix product. What did you do to get matrix([[ 0.962341]])?
Here's a simple function to compute the Frobenius norm of a sparse (e.g. CSR or CSC) matrix:
def spnorm(a):
return np.sqrt(((a.data**2).sum()))
For example,
In [182]: b_csr
Out[182]:
<3x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [183]: b_csr.A
Out[183]:
array([[ 1., 0., 0., 0., 0.],
[ 0., 2., 0., 4., 0.],
[ 0., 0., 0., 2., 1.]])
In [184]: spnorm(b_csr)
Out[184]: 5.0990195135927845
In [185]: norm(b_csr.A)
Out[185]: 5.0990195135927845

Categories