How to normalize in numpy? - python

I have the following question: A numpy array Y of shape (N, M) where Y[i] contains the same data as X[i], but normalized to have mean 0 and standard deviation 1.
I have mapped the array like this:
(X - np.mean(X)) / np.std(X)
but it doesn't give me the correct answer.

You want to normalize along a specific dimension, for instance -
(X - np.mean(X, axis=0)) / np.std(X, axis=0)
Otherwise you're calculating the statistics over the whole matrix, i.e. subtracting the global mean of all points/features and the same with the standard deviation.

Use norm from linalg
https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
from numpy import linalg as LA
a = np.arange(9) - 4
LA.norm(a)
>>>7.745966692414834
Then you divide the array by the norm :
a/LA.norm(a)

Related

Numpy cross covariance

Let X be a (d_x,n) matrix containing n observations of a d_x-dimensional variable x, and let w be a vector of weights (probabilities) of dimension n. The weighted covariance is given in numpy by
CX = numpy.cov(X, ddof=0, aweights=w)
Let now Y be a (d_y,n) matrix containing n observations of a d_y-dimensional vector. Is there a clever way to compute the weighted cross covariance, in pseudocode
CXY = sum(W[i] * numpy.outer((X[i, :] - X_mean),(Y[i, :] - Y_mean)))
?

Compute distances in kmeans Lloyds algorithm

I'm trying to compute the distance between each point of matrix X (shape N,D) and matrix mu (shape K,D) using numpy:
np.array([[np.linalg.norm(x - m) for m in mu] for x in X])
This is very slow. Is there a faster way to get the same result?
We can extend the dimensions of one matrix to a third dimension and then calculate the distance:
np.linalg.norm(X - mu[:,None], axis=-1, ord=2).T

Standardization of an numpy array

I am trying to standardize a numpy array of shape(M, N) so that its column mean is 0. I think I have used the formula of standardization correctly where x is the random variable and z is the standardized version of x.
z = (x - mean(x)) / std(x)
But the column mean of the resulted array is not 0. They are very small number but not zero. Any insight regarding my misunderstanding or mistake is welcome. Here is my code:
import numpy as np
X = np.load('data/filename.npy').astype('float')
XNormed = (X - np.mean(X, axis=0))/np.std(X, axis=0)
column_mean = np.mean(XNormed, axis=0)
print(column_mean)
Your code is correct but as you mentioned in the formula of your own question you need to divide by the standard deviation and not by the range of the data (as in your code). The line below , which uses numpy's std() should correct it:
XNormed = (X - X.mean())/(X.std())

Can covariance of A be used to calculate A'*A?

I am doing a benchmarking test in python on different ways to calculate A'*A, with A being a N x M matrix. One of the fastest ways was to use numpy.dot().
I was curious if I can obtain the same result using numpy.cov() (which gives the covariance matrix) by somehow varying the weights or by somehow pre-processing the A matrix ? But I had no success. Does anyone know if there is any relation between the product A'*A and covariance of A, where A is a matrix with N rows/observations and M columns/variables?
Have a look at the cov source. Near the end of the function it does this:
c = dot(X, X_T.conj())
Which is basically the dot product you want to perform. However, there are all kinds of other operations: checking inputs, subtracting the mean, normalization, ...
In short, np.cov will never ever be faster than np.dot(A.T, A) because internally it contains exactly that operation.
That said - the covariance matrix is computed as
Or in Python:
import numpy as np
a = np.random.rand(10, 3)
m = np.mean(a, axis=0, keepdims=True)
x = np.dot((a - m).T, a - m) / (a.shape[0] - 1)
y = np.cov(a.T)
assert np.allclose(x, y) # check they are equivalent
As you can see, the covariance matrix is equivalent to the raw dot product if you subtract the mean of each variable and divide the result by the number of samples (minus one).

How to efficiently compute moving average in python

I am looking for a way of calculating the mean of each given value in a 3d Numpy array with the 20 values in rows directly above and 20 values in rows directly below. This is similar to a previous question I asked (Taking minimum value of each entry +- 10 rows either side in numpy array) but calculating the mean of 41 values instead of the minimum of 21 values.
I have tried using Scipy's uniform 1d filter, but this does not have a mode which deals with the values close to the edge of the array correctly. The window which is outside of the array should not be included in the mean calculation (i.e. at the bottom/top locations in the array the mean should be taken from the edge value and the 20 rows above/below respectively).
Is there any way of using the uniform filter, or is there an alternative method which achieves this?
Thanks.
EDIT:
The Numpy array has dimensions 20x3200x18, so I was looking for a relatively efficient solution.
If you are really looking for performance in this, you can exploit cumsum in order to only have to calculate the sums once, this should make the implementation about 40 times faster.
See below for an example. Without your exact data and a reference implementation I cannot verify that this does exactly what you want, but it should be correct in spirit.
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.rand(20, 3200, 18)
n = 20
cumsum = np.cumsum(arr, axis=1)
means_lower = cumsum[:, :n, :] / np.arange(1, n + 1)[None, :, None]
means_middle = (cumsum[:, 2 * n:, :] - cumsum[:, :-2 * n , :]) / (2 * n)
means_upper = (cumsum[:, -1, :][:, None, :] - cumsum[:, -n - 1:-1, :]) / np.arange(n, 0, -1)[None, :, None]
means = np.concatenate([means_lower, means_middle, means_upper], axis=1)
x = np.arange(3200)
plt.plot(x, means[0, :, 0])
You can use scipy.signal.convolve to do this.
import scipy.signal as sig
def windowed_mean(arr, n):
dims = len(arr.shape)
s = sig.convolve(arr, np.ones((2*n+1,)*dims), mode='same')
d = sig.convolve(np.ones_like(arr), np.ones((2*n+1,)*dims), mode='same')
return s/d
Basically, s is a windowed sum and d is a windowed counter, so you avoid errors at the edge

Categories