Calculate squared deviation from the mean for each element in array - python

I have an array with shape (128,116,116,1), where 1st dimension asthe number of subjects, with the 2nd and 3rd being the data.
I was trying to calculate the variance (squared deviation from the mean) at each position (i.e: in (0,0), (0,1), (1,0), etc... until (116,116)) for all the 128 subjects, resulting in an array with shape (116,116).
Can anyone tell me how to accomplish this?
Thank you!

Let's say we have a multidimensional list a of shape (3,2,2)
import numpy as np
a =
[
[
[1,1],
[1,1]
],
[
[2,2],
[2,2]
],
[
[3,3],
[3,3]
],
]
np.var(a, axis = 0) # results in:
> array([[0.66666667, 0.66666667],
> [0.66666667, 0.66666667]])
If you want to efficiently compute the variance across all 128 subjects (which would be axis 0), I don't see a way to do it using the statistics package since it doesn't take multi-lists as input. So you will have to write your own code/logic and add loops on the subjects.
But, using the numpy.var
function, we can easily calculate the variance of each 'datapoint' (tuples of indices) across all 128 subjects.
Side note: You mentioned statistics.variance. However, that is only to be used when you are taking a sample from a population as is mentioned in the documentation you linked. If you were to go the manual route, you would use statistics.pvariance instead, since we are calculating it on the whole dataset.
The difference can be seen here:
statistics.pvariance([1,2,3])
> 0.6666666666666666 # (correct)
statistics.variance([1,2,3])
> 1 # (incorrect)
np.var([1,2,3])
> 0.6666666666666666 # (np.var also gives the correct output)

Related

Is it a problem of stability of the matrix calculation in python?

I recently encountered a problem of the inaccuracy of the matrix product/multiplication in NumPy. See my example below also here https://trinket.io/python3/6a4c22e450
import numpy as np
para = np.array([[ 3.28522453e+08, -1.36339334e+08, 1.36339334e+08],
[-1.36339334e+08, 5.65818682e+07, -5.65818682e+07],
[ 1.36339334e+08, -5.65818672e+07, 5.65818682e+07]])
in1 = np.array([[ 285.91695469],
[ 262.3 ],
[-426.64380594]])
in2 = np.array([[ 285.91695537],
[ 262.3 ],
[-426.64380443]])
(in1 - in2)/in1
>>> array([[-2.37831286e-09],
[ 0.00000000e+00],
[ 3.53925214e-09]])
The difference between in1 and in2 is very small, which is ~10^-9
res1 = para # in1
>>> array([[-356.2361908 ],
[ 443.16068268],
[-180.86068344]])
res2 = para # in2
>>> array([[ 73.03147125],
[265.01131439],
[ -2.71131516]])
but after the matrix multiplication, why does the difference between the output res1 and res2 change so much?
(res1 - res2)/res1
>>> array([[1.20500857],
[0.40199723],
[0.98500882]])
This is not a bug; it is to be expected with a matrix such as yours.
Your matrix (which is symmetric) has one large and two small eigenvalues:
In [34]: evals, evecs = np.linalg.eigh(para)
In [35]: evals
Out[35]: array([-1.06130078e-01, 1.00000000e+00, 4.41686189e+08])
Because the matrix is symmetric, it can be diagonalized with an orthonormal basis. That just means that we can define a new coordinate system in which the matrix is diagonal, and the diagonal values are those eigenvalues. The effect of multiplying the matrix by a vector in these coordinates is to simply multiply each coordinate by the corresponding eigenvalue, i.e. the first coordinate is multiplied by -0.106, the second coordinate doesn't change, and the third coordinate is multiplied by the large factor 4.4e8.
The reason you get such a drastic change when multiplying the original matrix para by in1 and in2 is that, in the new coordinates, the third component of the transformed in1 is positive, and the third component of the transformed in2 is negative. (That is, the points are on opposite sides of the 2-d eigenspace associated with the two smaller eigenvalues.) There are several ways to find these transformed coordinates; one is to compute inv(V)#x, where V is the matrix of eigenvectors:
In [36]: np.linalg.solve(evecs, in1)
Out[36]:
array([[ 5.64863071e+02],
[-1.16208620e+02],
[ 8.55527517e-07]])
In [37]: np.linalg.solve(evecs, in2)
Out[37]:
array([[ 5.64863070e+02],
[-1.16208619e+02],
[-2.71381169e-07]])
Note the different signs of the third components. The values are small, but when you multiply by the diagonal matrix, they are multiplied by 4.4e8, giving 377.87 and -119.86, respectively. That large change shows up as the results that you observed in the original coordinates.
For a rougher calculation: note that the elements of para are ~10^8, so multiplication on that order of magnitude occurs when you compute para # x. It is not surprising then, that given the relative differences between in1 and in2 are ~10^-9, the relative differences of res1 and res2 will be ~10^-9 * ~10^8 or ~0.1. (Your calculated relative errors were [1.2, 0.4, 0.99], so the rough estimate is in the right ballpark.)
This looks like a bug ... numpy is written in C, so this could be an issue of casting number into smaller float, which causes big floating point error in this case

Get Index of data point in training set with shortest distance to input matrix with Numpy

I would like to build a function npbatch(U,X) which compares data points in an input matrix (U) with data points in a training matrix (X) and gets me the index of X with the shortest euclidean distance to the data point in U.
I would like to avoid any loops to increase the performance and I would like to use the function scipy.spatial.distance.cdist to compute the distance.
Example Input:
U
array([[0.69646919, 0.28613933, 0.22685145],
[0.55131477, 0.71946897, 0.42310646],
[0.9807642 , 0.68482974, 0.4809319 ]])
X
array([[0.24875591, 0.16306678, 0.78364326],
[0.80852339, 0.62562843, 0.60411363],
[0.8857019 , 0.75911747, 0.18110506]])
--> Expected Output: Array with the three indices of the data points in X which have the shortest distance to the three data points in U.
My overall target is then to get the label of the corresponding data point using the index which I've got. Example for label input would be:
Y
array([1, 0, 0])
Thank you for any hint!
With scipy.spatial.distance.cdist you already chose a well-suited function for the task. To get the indices, we just have to apply numpy.argmin along the axis 0 (or axis 1 for cdist(U, X)):
ix = numpy.argmin(scipy.spatial.distance.cdist(X, U), 0)
Getting the label is then trivial:
Y[ix]

Intuition behind the correlation

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I'm basically trouble shooting the code and tried this:
cm = np.corrcoef(df_train[cols].values)
cm.shape
returns a matrix with shape 1460x1460. But when I input:
cm = np.corrcoef(df_train[cols].values.T)
cm.shape
it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.
The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:
Rx = [[ 1, r_xy],
[r_yx, 1]]
This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.
So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.
Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:
X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))
array([[1. , 0.04400245],
[0.04400245, 1. ]])
Which is doing the same as:
mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape
print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))
array([[1. , 0.04416552],
[0.04383998, 1. ]])
Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.
From numpy documentation of corrcoef:
x : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of x represents a variable, and
each column a single observation of all those variables. Also see rowvar below.
Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.
So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.
If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.

A random normally distributed matrix in numpy

I would like to generate a matrix M, whose elements M(i,j) are from a standard normal distribution. One trivial way of doing it is,
import numpy as np
A = [ [np.random.normal() for i in range(3)] for j in range(3) ]
A = np.array(A)
print(A)
[[-0.12409887 0.86569787 -1.62461893]
[ 0.30234536 0.47554092 -1.41780764]
[ 0.44443707 -0.76518672 -1.40276347]]
But, I was playing around with numpy and came across another "solution":
import numpy as np
import numpy.matlib as npm
A = np.random.normal(npm.zeros((3, 3)), npm.ones((3, 3)))
print(A)
[[ 1.36542538 -0.40676747 0.51832243]
[ 1.94915748 -0.86427391 -0.47288974]
[ 1.9303462 -1.26666448 -0.50629403]]
I read the document for numpy.random.normal, and it says it doesn't clarify how does this function work when matrix is passed instead of a single value. I suspected that in the second "solution" I might be drawing from a multivariate normal distribution. But this can't be true because the input arguments both have the same dimensions (covariance should be a matrix and mean is a vector). Not sure what is being generated by the second code.
The intended way to do what you want is
A = np.random.normal(0, 1, (3, 3))
This is the optional size parameter that tells numpy what shape you want returned (3 by 3 in this case).
Your second way works too, because the documentation states
If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.
So there is no multivariate distribution and no correlation.

SciPy method eigsh giving nonintuitive results

I tried to use SciPy function linalg.eigsh to calculate a few eigenvalues and eigenvectors of a matrix. However, when I print the calculated eigenvectors, they are of the same dimension as the number of eigenvalues I wanted to calculate. Shouldn't it give me the actual eigenvector, whose dimension is the same as that of the original matrix?
My code for reference:
id = np.eye(13)
val, vec = sp.sparse.linalg.eigsh(id, k = 2)
print(vec[1])
Which gives me:
[-0.26158945 0.63952164]
While intuitively it should have a dimension of 13. And it should not be a non-integer value either. Is it just my misinterpretation of the function? If so, is there any other function in Python that can calculate a few eigenvectors (I don't want the full spectrum) of the wanted dimensionality?
vec is an array with shape (13, 2).
In [21]: vec
Out[21]:
array([[ 0.36312724, -0.04921923],
[-0.26158945, 0.63952164],
[ 0.41693924, 0.34811192],
[ 0.30068329, -0.11360339],
[-0.05388733, -0.3225355 ],
[ 0.47402124, -0.28180261],
[ 0.50581823, 0.29527393],
[ 0.06687073, 0.19762049],
[ 0.103382 , 0.29724875],
[-0.09819873, 0.00949533],
[ 0.05458907, -0.22466131],
[ 0.15499849, 0.0621803 ],
[ 0.01420219, 0.04509334]])
The eigenvectors are stored in the columns of vec. To see the first eigenvector, use vec[:, 0]. When you printed vec[0] (which is equivalent to vec[0, :]), you printed the first row of vec, which is just the first components of the two eigenvectors.

Categories