Standardization of an numpy array - python

I am trying to standardize a numpy array of shape(M, N) so that its column mean is 0. I think I have used the formula of standardization correctly where x is the random variable and z is the standardized version of x.
z = (x - mean(x)) / std(x)
But the column mean of the resulted array is not 0. They are very small number but not zero. Any insight regarding my misunderstanding or mistake is welcome. Here is my code:
import numpy as np
X = np.load('data/filename.npy').astype('float')
XNormed = (X - np.mean(X, axis=0))/np.std(X, axis=0)
column_mean = np.mean(XNormed, axis=0)
print(column_mean)

Your code is correct but as you mentioned in the formula of your own question you need to divide by the standard deviation and not by the range of the data (as in your code). The line below , which uses numpy's std() should correct it:
XNormed = (X - X.mean())/(X.std())

Related

How to slice and calculate the pearson correlation coefficient between one big and small array with "overlapping" windows arrays

Suppose I have two very simple arrays with numpy:
import numpy as np
reference=np.array([0,1,2,3,0,0,0,7,8,9,10])
probe=np.zeros(3)
I would like to find which slice of array reference has the highest pearson's correlation coefficient with array probe. To do that, I would like to slice the array reference using some sort of sub-arrays that are overlapped in a for loop, which means I shift one element at a time of reference, and compare it against array probe. I did the slicing using the non elegant code below:
from statistics import correlation
for i in range(0,len(reference)):
#get the slice of the data
sliced_data=reference[i:i+len(probe)]
#only calculate the correlation when probe and reference have the same number of elements
if len(sliced_data)==len(probe):
my_rho = correlation(sliced_data, probe)
I have one issues and one question about such a code:
1-once I run the code, I have the error below:
my_rho = correlation(sliced_data, probe)
File "/usr/lib/python3.10/statistics.py", line 919, in correlation
raise StatisticsError('at least one of the inputs is constant')
statistics.StatisticsError: at least one of the inputs is constant
2- is there a more elegant way of doing such slicing with python?
You can use sliding_window_view to get the successive values, for a vectorized computation of the correlation, use a custom function:
from numpy.lib.stride_tricks import sliding_window_view as swv
def np_corr(X, y):
# adapted from https://stackoverflow.com/a/71253141
denom = (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2)
* (len(y) * np.sum(y**2) - np.sum(y)**2)))
return np.divide((len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))),
denom, where=denom!=0
)
corr = np_corr(swv(reference, len(probe)), probe)
Output:
array([ 1. , 1. , -0.65465367, -0.8660254 , 0. ,
0.8660254 , 0.91766294, 1. , 1. ])

How to normalize in numpy?

I have the following question: A numpy array Y of shape (N, M) where Y[i] contains the same data as X[i], but normalized to have mean 0 and standard deviation 1.
I have mapped the array like this:
(X - np.mean(X)) / np.std(X)
but it doesn't give me the correct answer.
You want to normalize along a specific dimension, for instance -
(X - np.mean(X, axis=0)) / np.std(X, axis=0)
Otherwise you're calculating the statistics over the whole matrix, i.e. subtracting the global mean of all points/features and the same with the standard deviation.
Use norm from linalg
https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
from numpy import linalg as LA
a = np.arange(9) - 4
LA.norm(a)
>>>7.745966692414834
Then you divide the array by the norm :
a/LA.norm(a)

Why is numpy's covariance slightly different to manually computing?

I'm just curious, and thought I'd ask this question. How come when I manually compute the covariance matrix of a set of data are my values slightly different to numpy's values?
I have two sets of data X and Y
data = io.loadmat("datafile.mat")['data']
X = data[:,0]
Y = data[:,1]
The covariance matrix can be computed like this (by looking at the correlation between X and X, X and Y, Y and X, etc.)
n = len(X)
corXX = np.var(X)
corXY = (1/n)*np.dot(X - np.mean(X), Y - np.mean(Y))
corYY = np.var(Y)
covariance = np.array([[corXX, corXY], [corXY, corYY] ])
For my dataset, that gives me:
array([[ 1.722105 , 5.34104265],
[ 5.34104265, 17.72717759]])
Whereas using numpy's covariance function covariance = np.cov(X,Y) gives me
array([[ 1.7395 , 5.39499258],
[ 5.39499258, 17.90623999]])
Similar, but not quite the same...
By default np.cov calculates the unbiased covariance which uses a factor (N-1) instead of N as you calculated.
If you check the documentation for np.cov you see that there is an argument (bias) to choose from the biased or unbiased versions of the covariance. By default it is set to false.
You can read more about the issue behind the use of a different prefactor in here if you're curious.

Numpy:zero mean data and standardization

I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0) and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.mean calculates arithmetic mean calculates mean along specific axis and np.std does so for standard deviation.
This is also called zscore.
SciPy has a utility for it:
>>> from scipy import stats
>>> stats.zscore([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091,
... 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])
Follow the comments in the code below
import numpy as np
# create x
x = np.asarray([1,2,3,4], dtype=np.float64)
np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)
np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)
From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:
np.mean(x) will compute the mean, by broadcasting x-np.mean(x) the mean of x will be subtracted form all the entries. x -=np.mean(x,axis = 0) is equivalent to x = x-np.mean(x,axis = 0). Similar for x/np.std(x).
In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpy word for dimension. So if your x is two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.
Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:
x -= np.expand_dims(np.mean(x, axis = n), n)
Key here are the assignment operators. They actually performs some operations on the original variable.
a += c is actually equal to a=a+c.
So indeed a (in your case x) has to be defined beforehand.
Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std.
Axis=1 would take values of each column for a given row.
What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.
So now, each of your column values is centered around zero and standardized.
There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.

Interpolate each row in matrix of x values

I want to interpolate between values in each row of a matrix (x-values) given a fixed vector of y-values. I am using python and essentially I need something like scipy.interpolate.interp1d but with x values being a matrix input. I implemented this by looping, but I want to make the operation as fast as possible.
Edit
Below is an example of a code of what I am doing right now, note that my matrix has more rows on order of millions:
import numpy as np
x = np.linspace(0,1,100).reshape(10,10)
results = np.zeros(10)
for i in range(10):
results[i] = np.interp(0.1,x[i],range(10))
As #Joe Kington suggested you can use map_coordinates:
import scipy.ndimage as nd
# your data - make sure is float/double
X = np.arange(100).reshape(10,10).astype(float)
# the points where you want to interpolate each row
y = np.random.rand(10) * (X.shape[1]-1)
# the rows at which you want the data interpolated -- all rows
r = np.arange(X.shape[0])
result = nd.map_coordinates(X, [r, y], order=1, mode='nearest')
The above, for the following y:
array([ 8.00091648, 0.46124587, 7.03994936, 1.26307275, 1.51068952,
5.2981205 , 7.43509764, 7.15198457, 5.43442468, 0.79034372])
Note, each value indicates the position in which the value is going to be interpolated for each row.
Gives the following result:
array([ 8.00091648, 10.46124587, 27.03994936, 31.26307275,
41.51068952, 55.2981205 , 67.43509764, 77.15198457,
85.43442468, 90.79034372])
which makes sense considering the nature of the aranged data, and the columns (y) at which it is interpolated.

Categories