I want to interpolate between values in each row of a matrix (x-values) given a fixed vector of y-values. I am using python and essentially I need something like scipy.interpolate.interp1d but with x values being a matrix input. I implemented this by looping, but I want to make the operation as fast as possible.
Edit
Below is an example of a code of what I am doing right now, note that my matrix has more rows on order of millions:
import numpy as np
x = np.linspace(0,1,100).reshape(10,10)
results = np.zeros(10)
for i in range(10):
results[i] = np.interp(0.1,x[i],range(10))
As #Joe Kington suggested you can use map_coordinates:
import scipy.ndimage as nd
# your data - make sure is float/double
X = np.arange(100).reshape(10,10).astype(float)
# the points where you want to interpolate each row
y = np.random.rand(10) * (X.shape[1]-1)
# the rows at which you want the data interpolated -- all rows
r = np.arange(X.shape[0])
result = nd.map_coordinates(X, [r, y], order=1, mode='nearest')
The above, for the following y:
array([ 8.00091648, 0.46124587, 7.03994936, 1.26307275, 1.51068952,
5.2981205 , 7.43509764, 7.15198457, 5.43442468, 0.79034372])
Note, each value indicates the position in which the value is going to be interpolated for each row.
Gives the following result:
array([ 8.00091648, 10.46124587, 27.03994936, 31.26307275,
41.51068952, 55.2981205 , 67.43509764, 77.15198457,
85.43442468, 90.79034372])
which makes sense considering the nature of the aranged data, and the columns (y) at which it is interpolated.
Related
I'm have a 3D problem where to final output is an array in the xy plane. I have an array in the x-z plane (dimensions (xsiz, zsiz)) and an array in the y-plane (dimension ysiz) as below:
xz = np.zeros((xsiz, zsiz))
y = (np.arange(ysiz)*(zsiz/ysiz)).astype(int)
xz can be thought of as an array of (zsiz) column vectors of size (xsiz) and labelled by z in range (0, zsiz-1). These are not conveniently accessible given the current setup - I've been retrieving them by np.transpose(xz)[z]. I would like the y array to act like a list of z values and take the column vectors labelled by these z values and combine them in a matrix with final dimension (xsiz, ysiz). (It seems likely to me that it will be easier to work with the transpose of xz so the row vectors can be retrieved as above and combined giving a (ysiz, xsiz) matrix which can then be transposed but I may be wrong.)
This would be a simple using for loops and I've given an example of a such a loop that does what I want below in case my explanation isn't clear. However, the final intention is for this code to be parallelized using CuPy so ideally I would like the entire process to be carried out by matrix manipulation. It seems like it should be possible like this but I can't think how!
Any help greatly appreciated.
import numpy as np
xsiz = 5 #sizes given random values for example
ysiz = 6
zsiz = 4
xz = np.arange(xsiz*zsiz).reshape(xsiz, zsiz)
y = (np.arange(ysiz)*(zsiz/ysiz)).astype(int)
xzT = np.transpose(xz)
final_xyT = np.zeros((0, xsiz))
for i in range(ysiz):
index = y[i]
xvec = xzT[index]
final_xyT = np.vstack((final_xyT, xvec))
#indexing could go wrong here if y contained large numbers
#CuPy's indexing wraps around so hopefully this shouldn't be too big an issue
final_xy = np.transpose(final_xyT)
print(xz)
print(final_xy)
If I correctly get your problem you need this:
xz[:,y]
I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I'm basically trouble shooting the code and tried this:
cm = np.corrcoef(df_train[cols].values)
cm.shape
returns a matrix with shape 1460x1460. But when I input:
cm = np.corrcoef(df_train[cols].values.T)
cm.shape
it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.
The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:
Rx = [[ 1, r_xy],
[r_yx, 1]]
This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.
So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.
Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:
X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))
array([[1. , 0.04400245],
[0.04400245, 1. ]])
Which is doing the same as:
mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape
print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))
array([[1. , 0.04416552],
[0.04383998, 1. ]])
Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.
From numpy documentation of corrcoef:
x : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of x represents a variable, and
each column a single observation of all those variables. Also see rowvar below.
Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.
So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.
If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.
I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0) and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.mean calculates arithmetic mean calculates mean along specific axis and np.std does so for standard deviation.
This is also called zscore.
SciPy has a utility for it:
>>> from scipy import stats
>>> stats.zscore([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091,
... 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])
Follow the comments in the code below
import numpy as np
# create x
x = np.asarray([1,2,3,4], dtype=np.float64)
np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)
np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)
From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:
np.mean(x) will compute the mean, by broadcasting x-np.mean(x) the mean of x will be subtracted form all the entries. x -=np.mean(x,axis = 0) is equivalent to x = x-np.mean(x,axis = 0). Similar for x/np.std(x).
In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpy word for dimension. So if your x is two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.
Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:
x -= np.expand_dims(np.mean(x, axis = n), n)
Key here are the assignment operators. They actually performs some operations on the original variable.
a += c is actually equal to a=a+c.
So indeed a (in your case x) has to be defined beforehand.
Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std.
Axis=1 would take values of each column for a given row.
What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.
So now, each of your column values is centered around zero and standardized.
There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.
Here is my problem : I manipulate 432*46*136*136 grids representing time*(space) encompassed in numpy arrays with numpy and python. I have one array alt, which encompasses the altitudes of the grid points, and another array temp which stores the temperature of the grid points.
It is problematic for a comparison : if T1 and T2 are two results, T1[t0,z0,x0,y0] and T2[t0,z0,x0,y0] represent the temperature at H1[t0,z0,x0,y0] and H2[t0,z0,x0,y0] meters, respectively. But I want to compare the temperature of points at the same altitude, not at the same grid point.
Hence I want to modify the z-axis of my matrices to represent the altitude and not the grid point. I create a function conv(alt[t,z,x,y]) which attributes a number between -20 and 200 to each altitude. Here is my code :
def interpolation_extended(self,temp,alt):
[t,z,x,y]=temp.shape
new=np.zeros([t,220,x,y])
for l in range(0,t):
for j in range(0,z):
for lat in range(0,x):
for lon in range(0,y):
new[l,conv(alt[l,j,lat,lon]),lat,lon]=temp[l,j,lat,lon]
return new
But this takes definitely too much time, I can't work this it. I tried to write it using universal functions with numpy :
def interpolation_extended(self,temp,alt):
[t,z,x,y]=temp.shape
new=np.zeros([t,220,x,y])
for j in range(0,z):
new[:,conv(alt[:,j,:,:]),:,:]=temp[:,j,:,:]
return new
But that does not work. Do you have any idea of doing this in python/numpy without using 4 nested loops ?
Thank you
I can't really try the code since I don't have your matrices, but something like this should do the job.
First, instead of declaring conv as a function, get the whole altitude projection for all your data:
conv = np.round(alt / 500.).astype(int)
Using np.round, the numpys version of round, it rounds all the elements of the matrix by vectorizing operations in C, and thus, you get a new array very quickly (at C speed). The following line aligns the altitudes to start in 0, by shifting all the array by its minimum value (in your case, -20):
conv -= conv.min()
the line above would transform your altitude matrix from [-20, 200] to [0, 220] (better for indexing).
With that, interpolation can be done easily by getting multidimensional indices:
t, z, y, x = np.indices(temp.shape)
the vectors above contain all the indices needed to index your original matrix. You can then create the new matrix by doing:
new_matrix[t, conv[t, z, y, x], y, x] = temp[t, z, y, x]
without any loop at all.
Let me know if it works. It might give you some erros since is hard for me to test it without data, but it should do the job.
The following toy example works fine:
A = np.random.randn(3,4,5) # Random 3x4x5 matrix -- your temp matrix
B = np.random.randint(0, 10, 3*4*5).reshape(3,4,5) # your conv matrix with altitudes from 0 to 9
C = np.zeros((3,10,5)) # your new matrix
z, y, x = np.indices(A.shape)
C[z, B[z, y, x], x] = A[z, y, x]
C contains your results by altitude.
Suppose I have two vectors of length 25, and I want to compute their covariance matrix. I try doing this with numpy.cov, but always end up with a 2x2 matrix.
>>> import numpy as np
>>> x=np.random.normal(size=25)
>>> y=np.random.normal(size=25)
>>> np.cov(x,y)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
Using the rowvar flag doesn't help either - I get exactly the same result.
>>> np.cov(x,y,rowvar=0)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
How can I get the 25x25 covariance matrix?
You have two vectors, not 25. The computer I'm on doesn't have python so I can't test this, but try:
z = zip(x,y)
np.cov(z)
Of course.... really what you want is probably more like:
n=100 # number of points in each vector
num_vects=25
vals=[]
for _ in range(num_vects):
vals.append(np.random.normal(size=n))
np.cov(vals)
This takes the covariance (I think/hope) of num_vects 1xn vectors
Try this:
import numpy as np
x=np.random.normal(size=25)
y=np.random.normal(size=25)
z = np.vstack((x, y))
c = np.cov(z.T)
Covariance matrix from samples vectors
To clarify the small confusion regarding what is a covariance matrix defined using two N-dimensional vectors, there are two possibilities.
The question you have to ask yourself is whether you consider:
each vector as N realizations/samples of one single variable (for example two 3-dimensional vectors [X1,X2,X3] and [Y1,Y2,Y3], where you have 3 realizations for the variables X and Y respectively)
each vector as 1 realization for N variables (for example two 3-dimensional vectors [X1,Y1,Z1] and [X2,Y2,Z2], where you have 1 realization for the variables X,Y and Z per vector)
Since a covariance matrix is intuitively defined as a variance based on two different variables:
in the first case, you have 2 variables, N example values for each, so you end up with a 2x2 matrix where the covariances are computed thanks to N samples per variable
in the second case, you have N variables, 2 samples for each, so you end up with a NxN matrix
About the actual question, using numpy
if you consider that you have 25 variables per vector (took 3 instead of 25 to simplify example code), so one realization for several variables in one vector, use rowvar=0
# [X1,Y1,Z1]
X_realization1 = [1,2,3]
# [X2,Y2,Z2]
X_realization2 = [2,1,8]
numpy.cov([X,Y],rowvar=0) # rowvar false, each column is a variable
Code returns, considering 3 variables:
array([[ 0.5, -0.5, 2.5],
[-0.5, 0.5, -2.5],
[ 2.5, -2.5, 12.5]])
otherwise, if you consider that one vector is 25 samples for one variable, use rowvar=1 (numpy's default parameter)
# [X1,X2,X3]
X = [1,2,3]
# [Y1,Y2,Y3]
Y = [2,1,8]
numpy.cov([X,Y],rowvar=1) # rowvar true (default), each row is a variable
Code returns, considering 2 variables:
array([[ 1. , 3. ],
[ 3. , 14.33333333]])
Reading the documentation as,
>> np.cov.__doc__
or looking at Numpy Covariance, Numpy treats each row of array as a separate variable, so you have two variables and hence you get a 2 x 2 covariance matrix.
I think the previous post has right solution. I have the explanation :-)
I suppose what youre looking for is actually a covariance function which is a timelag function. I'm doing autocovariance like that:
def autocovariance(Xi, N, k):
Xs=np.average(Xi)
aCov = 0.0
for i in np.arange(0, N-k):
aCov = (Xi[(i+k)]-Xs)*(Xi[i]-Xs)+aCov
return (1./(N))*aCov
autocov[i]=(autocovariance(My_wector, N, h))
You should change
np.cov(x,y, rowvar=0)
onto
np.cov((x,y), rowvar=0)
What you got (2 by 2) is more useful than 25*25. Covariance of X and Y is an off-diagonal entry in the symmetric cov_matrix.
If you insist on (25 by 25) which I think useless, then why don't you write out the definition?
x=np.random.normal(size=25).reshape(25,1) # to make it 2d array.
y=np.random.normal(size=25).reshape(25,1)
cov = np.matmul(x-np.mean(x), (y-np.mean(y)).T) / len(x)
As pointed out above, you only have two vectors so you'll only get a 2x2 cov matrix.
IIRC the 2 main diagonal terms will be sum( (x-mean(x))**2) / (n-1) and similarly for y.
The 2 off-diagonal terms will be sum( (x-mean(x))(y-mean(y)) ) / (n-1). n=25 in this case.
according the document, you should expect variable vector in column:
If we examine N-dimensional samples, X = [x1, x2, ..., xn]^T
though later it says each row is a variable
Each row of m represents a variable.
so you need input your matrix as transpose
x=np.random.normal(size=25)
y=np.random.normal(size=25)
X = np.array([x,y])
np.cov(X.T)
and according to wikipedia: https://en.wikipedia.org/wiki/Covariance_matrix
X is column vector variable
X = [X1,X2, ..., Xn]^T
COV = E[X * X^T] - μx * μx^T // μx = E[X]
you can implement it yourself:
# X each row is variable
X = X - X.mean(axis=0)
h,w = X.shape
COV = X.T # X / (h-1)
i don't think you understand the definition of covariance matrix.
If you need 25 x 25 covariance matrix, you need 25 vectors each with n data points.