Intuition behind the correlation - python

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I'm basically trouble shooting the code and tried this:
cm = np.corrcoef(df_train[cols].values)
cm.shape
returns a matrix with shape 1460x1460. But when I input:
cm = np.corrcoef(df_train[cols].values.T)
cm.shape
it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.

The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:
Rx = [[ 1, r_xy],
[r_yx, 1]]
This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.
So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.
Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:
X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))
array([[1. , 0.04400245],
[0.04400245, 1. ]])
Which is doing the same as:
mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape
print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))
array([[1. , 0.04416552],
[0.04383998, 1. ]])
Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.

From numpy documentation of corrcoef:
x : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of x represents a variable, and
each column a single observation of all those variables. Also see rowvar below.
Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.
So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.
If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.

Related

Get Index of data point in training set with shortest distance to input matrix with Numpy

I would like to build a function npbatch(U,X) which compares data points in an input matrix (U) with data points in a training matrix (X) and gets me the index of X with the shortest euclidean distance to the data point in U.
I would like to avoid any loops to increase the performance and I would like to use the function scipy.spatial.distance.cdist to compute the distance.
Example Input:
U
array([[0.69646919, 0.28613933, 0.22685145],
[0.55131477, 0.71946897, 0.42310646],
[0.9807642 , 0.68482974, 0.4809319 ]])
X
array([[0.24875591, 0.16306678, 0.78364326],
[0.80852339, 0.62562843, 0.60411363],
[0.8857019 , 0.75911747, 0.18110506]])
--> Expected Output: Array with the three indices of the data points in X which have the shortest distance to the three data points in U.
My overall target is then to get the label of the corresponding data point using the index which I've got. Example for label input would be:
Y
array([1, 0, 0])
Thank you for any hint!
With scipy.spatial.distance.cdist you already chose a well-suited function for the task. To get the indices, we just have to apply numpy.argmin along the axis 0 (or axis 1 for cdist(U, X)):
ix = numpy.argmin(scipy.spatial.distance.cdist(X, U), 0)
Getting the label is then trivial:
Y[ix]

How important are the rows vs columns in PCA?

So i have a dataset with pictures, where each column consist of a vector that can be reshaped into a 32x32 picture. The specific dimensions of my dataset is the following 1024 x 20000. Meaning 20000 samples of images.
Now when i look at various ways of doing PCA without using the built in functions from something like scikit-learn people tend to take either the mean of the rows and subtract the resulting matrix from the original one to get the covariance matrix. I.e the following
A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(0)
new_A = A-mean_rows
Other times people tend to get the mean of the columns and the subtract that from the original matrix.
A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(1)
new_A = A-mean_rows
Now my question is, when are you supposed to do what? Say i have a dataset as my example which of the methods would i use?
Looked at a variety of websites such as https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/,
http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
I think you're talking about normalizing the dataset to have zero mean. You should compute the mean across the axis that contains each observation.
In your example, you have 20,000 observations with 1,024 dimensions each and your matrix has laid out each observation as a column so you should compute the mean of the columns.
In code that would be:
A = A - A.mean(axis=0)

Reshaping numpy array

What I am trying to do is take a numpy array representing 3D image data and calculate the hessian matrix for every voxel. My input is a matrix of shape (Z,X,Y) and I can easily take a slice along z and retrieve a single original image.
gx, gy, gz = np.gradient(imgs)
gxx, gxy, gxz = np.gradient(gx)
gyx, gyy, gyz = np.gradient(gy)
gzx, gzy, gzz = np.gradient(gz)
And I can access the hessian for an individual voxel as follows:
x = 100
y = 100
z = 63
H = [[gxx[z][x][y], gxy[z][x][y], gxz[z][x][y]],
[gyx[z][x][y], gyy[z][x][y], gyz[z][x][y]],
[gzx[z][x][y], gzy[z][x][y], gzz[z][x][y]]]
But this is cumbersome and I can't easily slice the data.
I have tried using reshape as follows
H = H.reshape(Z, X, Y, 3, 3)
But when I test this by retrieving the hessian for a specific voxel the, the value returned from the reshaped array is completely different than the original array.
I think I could use zip somehow but I have only been able to find that for making lists of tuples.
Bonus: If there's a faster way to accomplish this please let me know, I essentially need to calculate the three eigenvalues of the hessian matrix for every voxel in the 3D data set. Calculating the hessian values is really fast but finding the eigenvalues for a single 2D image slice takes about 20 seconds. Are there any GPUs or tensor flow accelerated libraries for image processing?
We can use a list comprehension to get the hessians -
H_all = np.array([np.gradient(i) for i in np.gradient(imgs)]).transpose(2,3,4,0,1)
Just to give it a bit of explanation : [np.gradient(i) for i in np.gradient(imgs)] loops through the two levels of outputs from np.gradient calls, resulting in a (3 x 3) shaped tensor at the outer two axes. We need these two as the last two axes in the final output. So, we push those at the end with the transpose.
Thus, H_all holds all the hessians and hence we can extract our specific hessian given x,y,z, like so -
x = 100
y = 100
z = 63
H = H_all[z,y,x]

Numpy:zero mean data and standardization

I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0) and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.mean calculates arithmetic mean calculates mean along specific axis and np.std does so for standard deviation.
This is also called zscore.
SciPy has a utility for it:
>>> from scipy import stats
>>> stats.zscore([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091,
... 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])
Follow the comments in the code below
import numpy as np
# create x
x = np.asarray([1,2,3,4], dtype=np.float64)
np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)
np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)
From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:
np.mean(x) will compute the mean, by broadcasting x-np.mean(x) the mean of x will be subtracted form all the entries. x -=np.mean(x,axis = 0) is equivalent to x = x-np.mean(x,axis = 0). Similar for x/np.std(x).
In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpy word for dimension. So if your x is two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.
Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:
x -= np.expand_dims(np.mean(x, axis = n), n)
Key here are the assignment operators. They actually performs some operations on the original variable.
a += c is actually equal to a=a+c.
So indeed a (in your case x) has to be defined beforehand.
Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std.
Axis=1 would take values of each column for a given row.
What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.
So now, each of your column values is centered around zero and standardized.
There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.

Interpolate each row in matrix of x values

I want to interpolate between values in each row of a matrix (x-values) given a fixed vector of y-values. I am using python and essentially I need something like scipy.interpolate.interp1d but with x values being a matrix input. I implemented this by looping, but I want to make the operation as fast as possible.
Edit
Below is an example of a code of what I am doing right now, note that my matrix has more rows on order of millions:
import numpy as np
x = np.linspace(0,1,100).reshape(10,10)
results = np.zeros(10)
for i in range(10):
results[i] = np.interp(0.1,x[i],range(10))
As #Joe Kington suggested you can use map_coordinates:
import scipy.ndimage as nd
# your data - make sure is float/double
X = np.arange(100).reshape(10,10).astype(float)
# the points where you want to interpolate each row
y = np.random.rand(10) * (X.shape[1]-1)
# the rows at which you want the data interpolated -- all rows
r = np.arange(X.shape[0])
result = nd.map_coordinates(X, [r, y], order=1, mode='nearest')
The above, for the following y:
array([ 8.00091648, 0.46124587, 7.03994936, 1.26307275, 1.51068952,
5.2981205 , 7.43509764, 7.15198457, 5.43442468, 0.79034372])
Note, each value indicates the position in which the value is going to be interpolated for each row.
Gives the following result:
array([ 8.00091648, 10.46124587, 27.03994936, 31.26307275,
41.51068952, 55.2981205 , 67.43509764, 77.15198457,
85.43442468, 90.79034372])
which makes sense considering the nature of the aranged data, and the columns (y) at which it is interpolated.

Categories