Numpy:zero mean data and standardization - python

I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0) and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.mean calculates arithmetic mean calculates mean along specific axis and np.std does so for standard deviation.

This is also called zscore.
SciPy has a utility for it:
>>> from scipy import stats
>>> stats.zscore([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091,
... 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])

Follow the comments in the code below
import numpy as np
# create x
x = np.asarray([1,2,3,4], dtype=np.float64)
np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)
np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)

From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:
np.mean(x) will compute the mean, by broadcasting x-np.mean(x) the mean of x will be subtracted form all the entries. x -=np.mean(x,axis = 0) is equivalent to x = x-np.mean(x,axis = 0). Similar for x/np.std(x).
In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpy word for dimension. So if your x is two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.
Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:
x -= np.expand_dims(np.mean(x, axis = n), n)

Key here are the assignment operators. They actually performs some operations on the original variable.
a += c is actually equal to a=a+c.
So indeed a (in your case x) has to be defined beforehand.
Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std.
Axis=1 would take values of each column for a given row.
What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.
So now, each of your column values is centered around zero and standardized.
There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.

Related

Intuition behind the correlation

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I'm basically trouble shooting the code and tried this:
cm = np.corrcoef(df_train[cols].values)
cm.shape
returns a matrix with shape 1460x1460. But when I input:
cm = np.corrcoef(df_train[cols].values.T)
cm.shape
it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.
The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:
Rx = [[ 1, r_xy],
[r_yx, 1]]
This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.
So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.
Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:
X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))
array([[1. , 0.04400245],
[0.04400245, 1. ]])
Which is doing the same as:
mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape
print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))
array([[1. , 0.04416552],
[0.04383998, 1. ]])
Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.
From numpy documentation of corrcoef:
x : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of x represents a variable, and
each column a single observation of all those variables. Also see rowvar below.
Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.
So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.
If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.

Trace max along dim1 with varying index in dim3

I have a "cube" of 3D data where there is some peak in the column, or first dimension. The index of the peak may shift depending what row is examined. The third dimension may do something a bit more complicated, but for now can be thought of as just scaling things by some linear function.
I would like to find the index of the max along the first dimension, subject to the constraint that for each row, the z index is chosen such that the column peak will be closest to 0.5.
Here's a sample image that is a plane in row,column with a fixed z:
These arrays will at times be large -- say, 21x11x200 float64s, so I would like to vectorize this calculation. Written with a for loop, it looks like this:
cols, rows, zs = data.shape
for i in range(rows):
# for each field point, make an intermediate array that is 2D with focus,frequency dimensions
arr = data[:,i,:]
# compute the thru-focus max and find the peak closest to 0.5
maxs = np.max(arr, axis=0)
max_manip = np.abs(maxs-0.5)
freq_idx = np.argmin(max_manip)
# take the thru-focus slice that peaks closest to 0.5
arr2 = data[:,i,freq_idx]
focus_idx = np.argmax(arr2)
print(focus_idx)
My issue is that I do not know how to roll these calculations up into a vector operation. I would appreciate any help, thanks!
We just need to use the axis param with the relevant ufuncs there and that would lead us to a vectorized solution, like so -
# Get freq indices along all rows in one go
idx = np.abs(data.max(0)-0.5).argmin(1)
# Index into data with those and get the argmax indices
out = data[:,np.arange(data.shape[1]), idx].argmax(0)

Applying a mask for speeding up various array calculations

I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated
I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.

Numpy: evaluation of standard deviation of values above/below the average

I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

Categories