I would like to create a python function to linearly interpolate within a partly empty grid and get a nearest extrapolation out of bounds.
Let's say I have the following data stored in pandas DataFrame:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: x = [0,1,2,3,4]
In [4]: y = [0.5,1.5,2.5,3.5,4.5,5.5]
In [5]: z = np.array([[np.nan,np.nan,1.5,2.0,5.5,3.5],[np.nan,1.0,4.0,2.5,4.5,3.0],[2.0,0.5,6.0,1.5,3.5,np.nan],[np.nan,1.5,4.0,2.0,np.nan,np.nan],[np.nan,np.nan,2.0,np.nan,np.nan,np.nan]])
In [6]: df = pd.DataFrame(z,index=x,columns=y)
In [7]: df
Out[7]:
0.5 1.5 2.5 3.5 4.5 5.5
0 NaN NaN 1.5 2.0 5.5 3.5
1 NaN 1.0 4.0 2.5 4.5 3.0
2 2.0 0.5 6.0 1.5 3.5 NaN
3 NaN 1.5 4.0 2.0 NaN NaN
4 NaN NaN 2.0 NaN NaN NaN
I would like to get function myInterp that returns a linear interpolation within data boundaries (i.e. not NaN values) and get a nearest extrapolation outside bounds (i.e. NaN or no values) such as:
In [1]: myInterp([1.5,2.5]) #linear interpolation
Out[1]: 5.0
In [2]: myInterp([1.5,4.0]) #bi-linear interpolation
Out[2]: 3.0
In [3]: myInterp([0.0,2.0]) #nearest extrapolation (inside grid)
Out[3]: 1.5
In [4]: myInterp([5.0,2.5]) #nearest extrapolation (outside grid)
Out[4]: 2.0
I tried many combination of scipy.interpolate package with no success, does anyone have a suggestion how to do it ?
Yes, unfortunately scipy doesn't deal with nans
From the docs:
Note that calling interp2d with NaNs present in input values results in undefined behaviour.
Even masking the nans in a np.masked_array was not successful.
So my advice would be to remove all the nan entries from z by taking the opportunity to give sp.interp2d the full list of x- and y-coordinates for only the valid data and leave z also 1D:
X=[];Y=[];Z=[] # initialize new 1-D-lists for interp2
for i, xi in enumerate(x): # iterate through x
for k, yk in enumerate(y): # iterate through y
if not np.isnan(z[i, k]): # check if z-value is valid...
X.append(xi) # ...and if so, append coordinates and value to prepared lists
Y.append(yk)
Z.append(z[i, k])
This way at least sp.interp2d works and gives a result:
ip = sp.interpolate.interp2d(X,Y,Z)
However, the values in the result won't please you:
In: ip(x,y)
Out:
array([[ 18.03583061, -0.44933642, 0.83333333, -1. , -1.46105542],
[ 9.76791531, 1.3014037 , 2.83333333, 1.5 , 0.26947229],
[ 1.5 , 3.05214381, 4.83333333, 4. , 2. ],
[ 2. , 3.78378051, 1.5 , 2. , 0.8364618 ],
[ 5.5 , 3.57039277, 3.5 , -0.83019815, -0.7967441 ],
[ 3.5 , 3.29227922, 17.29607177, 0. , 0. ]])
compared to the input data:
In:z
Out:
array([[ nan, nan, 1.5, 2. , 5.5, 3.5],
[ nan, 1. , 4. , 2.5, 4.5, 3. ],
[ 2. , 0.5, 6. , 1.5, 3.5, nan],
[ nan, 1.5, 4. , 2. , nan, nan],
[ nan, nan, 2. , nan, nan, nan]])
But IMHO this is because the gradient changes in your data are far too high. Even more with respect to the low number of data samples.
I hope this is just a test data set and your real application has smoother gradients and some more samples. Then I'd be glad to hear if it works...
However, the trivial test with an array of zero gradient - only destructed by nans a little bit - could give a hint that interpolation should work, while extrapolation is only partly correct:
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3. , 0. ],
[ 3. , 3. , 3. , 3. , 1.94701008],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 1.54973345],
[ 3. , 3. , 3. , 3. , 0.37706713],
[ 3. , 3. , 2.32108317, 0.75435203, 0. ]])
resulting from the trivial test input
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])
PS: Looking closer to the right hand side: there are even valid entries completely changed, i.e made wrong, which introduces errors in a following analysis.
But surprise: the cubic version performs much better here:
In:ip = sp.interpolate.interp2d(X,Y,Z, kind='cubic')
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3.02397028, 3.0958811 ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 2.97602972, 2.9041189 ],
[ 3. , 3. , 3. , 2.9041189 , 2.61647559]])
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])
Since scipy.interp2d doesn't deal with Nans, the solution is to fill the NaNs in the DataFrame before using interp2d. This can be done by using pandas.interpolate function.
In the previous example, the following provide the desired output:
In [1]: from scipy.interpolate import interp2d
In [2]: df = df.interpolate(limit_direction='both',axis=1,inplace=True)
In [3]: myInterp = interp2d(df.index,df.columns,df.values.T)
In [4]: myInterp(1.5,2.5)
Out[4]: array([5.])
In [5]: myInterp(1.5,4.0)
Out[5]: array([3.])
In [6]: myInterp(0.0,2.0)
Out[6]: array([1.5])
In [7]: myInterp(5.0,2.5)
Out[7]: array([2.])
Related
my_list=[[ 0., 40. , nan],
[60. , 0. , nan],
[ nan , nan , nan]]
Is it possible that I can remove the nan value?
Expected output:
my_list=[[0.,40.],
[60., 0.]]
import numpy as np
x=np.array([[ 0., 40. , np.nan],
[60. , 0. , np.nan],
[ np.nan , np.nan , np.nan]])
x = (x[~np.isnan(x).all(axis=1), :]) # remove rows with nan
x = (x[:, ~np.isnan(x).all(axis=0)]) # remove cols with nan
Output
[[ 0. 40.]
[60. 0.]]
But as said #mozway if a row or column contains at least one not nan, then it will remain in the result.
Given a parameter p, be any float or integer.
For example, let p=4
time
1
2
3
4
5
Numbers
a1
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(3-1)
a1*(0.5)^(1/p)^(4-1)
Numbers
nan
a2
a2*(0.5)^(1/p)^(3-2)
a2*(0.5)^(1/p)^(4-2)
a2*(0.5)^(1/p)^(5-2)
Numbers
nan
nan
a3
a3*(0.5)^(1/p)^(4-3)
a3*(0.5)^(1/p)^(5-3)
Numbers
nan
nan
nan
a4
a4*(0.5)^(1/p)^(5-4)
Number
nan
nan
nan
nan
a5
Final Results
a1
sum of column 2
sum of column 3
sum of column 4
sum of column 5
Numbers like a1,a2,a3,a4,a5,...,at is given, our goal is to find the Final Results. Combining the answer provided by mozway, I wrote the following function which works well. It is a matrix way to solve the problem.
def hl(p,column):
a = np.arange(len(copy_raw))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
inter = ((1/2)**(1/p))**factors
copy_raw[column] = np.nansum(copy_raw[column].to_numpy()*inter, axis=1)
However, I don't think this method will work well if we are dealing with large dataframe. Are there any better way to fix the problem? (In this case, faster = better.)
Assuming your number of rows is not too large, you can achieve this with numpy broadcasting:
First create a 2D array of factors:
a = np.arange(len(df))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
# array([[ 0., nan, nan, nan, nan],
# [ 1., 0., nan, nan, nan],
# [ 2., 1., 0., nan, nan],
# [ 3., 2., 1., 0., nan],
# [ 4., 3., 2., 1., 0.]])
Then map to your data and sum:
df['number2'] = np.nansum(df['number'].to_numpy()*(1/2)**factors, axis=1)
example output:
Index Time number number2
0 0 1997-WK01 1 1.0000
1 1 1997-WK02 2 2.5000
2 2 1997-WK03 3 4.2500
3 3 1997-WK04 2 4.1250
4 4 1997-WK05 4 6.0625
intermediate:
df['number'].to_numpy()*(1/2)**factors
# array([[1. , nan, nan, nan, nan],
# [0.5 , 2. , nan, nan, nan],
# [0.25 , 1. , 3. , nan, nan],
# [0.125 , 0.5 , 1.5 , 2. , nan],
# [0.0625, 0.25 , 0.75 , 1. , 4. ]])
Let's say I have data for 3 variable pairs, A, B, and C (in my actual application the number of variables is anywhere from 1000-3000 but could be even higher).
Let's also say that there are pieces of the data that come in arrays.
For example:
Array X:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
Where:
X[0,0] = corresponds to data for variables A and A
X[0,1] = corresponds to data for variables A and B
X[0,2] = corresponds to data for variables A and C
X[1,0] = corresponds to data for variables B and A
X[1,1] = corresponds to data for variables B and B
X[1,2] = corresponds to data for variables B and C
X[2,0] = corresponds to data for variables C and A
X[2,1] = corresponds to data for variables C and B
X[2,2] = corresponds to data for variables C and C
Array Y:
np.array([[2,12],
[-12, 2]])
Y[0,0] = corresponds to data for variables A and C
Y[0,1] = corresponds to data for variables A and B
Y[1,0] = corresponds to data for variables B and A
Y[1,1] = corresponds to data for variables C and A
Array Z:
np.array([[ 99, 77],
[-77, -99]])
Z[0,0] = corresponds to data for variables A and C
Z[0,1] = corresponds to data for variables B and C
Z[1,0] = corresponds to data for variables C and B
Z[1,1] = corresponds to data for variables C and A
I want to concatenate the above arrays keeping the variable position fixed as follows:
END_RESULT_ARRAY index 0 corresponds to variable A
END_RESULT_ARRAY index 1 corresponds to variable B
END_RESULT_ARRAY index 2 corresponds to variable C
Basically, there are N variables in the universe but can change every month (new ones can be introduced and existing ones can drop out and then return or never return). Within the N variables in the universe I compute permutations pairs and the positioning of each variable is fixed i.e. index 0 corresponds to variable A, index = 1 corresponds to variable B (as described above).
Given the above requirement the end END_RESULT_ARRAY should look like the following:
array([[[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]],
[[ nan, 12., 2.],
[-12., nan, nan],
[ 2., nan, nan]],
[[ nan, nan, 99.],
[ nan, nan, 77.],
[-99., -77., nan]]])
Keep in mind that the above is an illustration.
In my actual application, I have about 125 arrays and a new one is generated every month. Each monthly array may have different sizes and may only have data for a portion of the variables defined in my universe. Also, as new arrays are created each month there is no way of knowing what its size will be or which variables will have data (or which ones will be missing).
So up until the most recent monthly array, we can determine the max size from the available historical data. Each month we will have to re-check the max size of all the arrays as a new array comes available. Once we have the max size we can then re-stitch/concatenate all the arrays together IF THIS IS SOMETHING THAT IS DOABLE in numpy. This will be an on-going operation done every month.
I want a general mechanism to be able to stitch these arrays together keeping the requirements I describe regarding the index position for the variables fixed.
I actually want to use H5PY arrays as my data set will grow exponentially not too distant future. However, I would like to get this working with numpy as a first step.
Based on the comment made by #user3483203. The next step is to concatenate the arrays.
a = np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
b = np.array([[0,12], [-12, 0]])
out = np.full_like(a, np.nan); i, j = b.shape; out[:i, :j] = b
res = np.array([a, out])
print (res)
This answers the original question which has since been changed:
Lets say I have the following arrays:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
np.array([[0,12],
[-12, 0]])
I want to concatenate the above 2 arrays such that the end result is
as follows:
array([[[0, 2, 3],
[-2, 0, 4],
[-3,-4, 0]],
[[0,12, np.nan],
[-12, 0, np.nan],
[np.nan, np.nan, np.nan]]])
Find out how much each array exceeds the max size in each dimension, then use np.pad to pad at the end of each dimension, then finally np.stack to stack them together:
import numpy as np
a = np.arange(12).reshape(4,3).astype(np.float)
b = np.arange(4).reshape(1,4).astype(np.float)
arrs = (a,b)
dims = len(arrs[0].shape)
maxshape = tuple( max(( x.shape[i] for x in arrs)) for i in range(dims))
paddedarrs = ( np.pad(x, tuple((0, maxshape[i]-x.shape[i]) for i in range(dims)), 'constant', constant_values=(np. nan,)) for x in (a,b))
c = np.stack(paddedarrs,0)
print (a)
print(b,"\n======================")
print(c)
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]
[ 9. 10. 11.]]
[[0. 1. 2. 3.]]
======================
[[[ 0. 1. 2. nan]
[ 3. 4. 5. nan]
[ 6. 7. 8. nan]
[ 9. 10. 11. nan]]
[[ 0. 1. 2. 3.]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]]
According to this we can get labels for non-singleton clusters.
I tried this with a simple example.
import numpy as np
import scipy.cluster.hierarchy
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
mat = np.array([[ 0. , 1. , 3. ,0. ,2. ,3. ,1.],
[ 1. , 0. , 3. , 1., 1. , 2. , 2.],
[ 3., 3. , 0., 3. , 3., 3. , 4.],
[ 0. , 1. , 3., 0. , 2. , 3., 1.],
[ 2. , 1., 3. , 2., 0. , 1., 3.],
[ 3. , 2., 3. , 3. , 1. , 0. , 3.],
[ 1. , 2., 4. , 1. , 3., 3. , 0.]])
def llf(id):
if id < n:
return str(id)
else:
return '[%d %d %1.2f]' % (id, count, R[n-id,3])
linkage_matrix = linkage(mat, "complete")
dendrogram(linkage_matrix,
p=4,
leaf_label_func=llf,
color_threshold=1,
truncate_mode='lastp',
distance_sort='ascending')
plt.show()
What are n, and count here?In a diagram like following I need to know who are listed under(3) and (2)?
I think the document is not very clear at this part and the sample code in it is not even operational. But it is clear that 1 means the 2nd observation and (3) means there are 3 observation in that node.
If you want to know what are the 3 obs. in the 2nd node, if that is your question:
In [51]:
D4=dendrogram(linkage_matrix,
color_threshold=1,
p=4,
truncate_mode='lastp',
distance_sort='ascending')
D7=dendrogram(linkage_matrix,
color_list=['g',]*7,
p=7,
truncate_mode='lastp',
distance_sort='ascending', no_plot=True)
from itertools import groupby
[list(group) for key, group in groupby(D7['ivl'],lambda x: x in D4['ivl'])]
Out[51]:
[['1'], ['6', '0', '3'], ['2'], ['4', '5']]
The 2nd node contains obs. 7th, 1th and 4th, and the 2th node contains the 5th and the 6th observations.
According to this we can get labels for non-singleton clusters.
I tried this with a simple example.
import numpy as np
import scipy.cluster.hierarchy
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
mat = np.array([[ 0. , 1. , 3. ,0. ,2. ,3. ,1.],
[ 1. , 0. , 3. , 1., 1. , 2. , 2.],
[ 3., 3. , 0., 3. , 3., 3. , 4.],
[ 0. , 1. , 3., 0. , 2. , 3., 1.],
[ 2. , 1., 3. , 2., 0. , 1., 3.],
[ 3. , 2., 3. , 3. , 1. , 0. , 3.],
[ 1. , 2., 4. , 1. , 3., 3. , 0.]])
def llf(id):
if id < n:
return str(id)
else:
return '[%d %d %1.2f]' % (id, count, R[n-id,3])
linkage_matrix = linkage(mat, "complete")
dendrogram(linkage_matrix,
p=4,
leaf_label_func=llf,
color_threshold=1,
truncate_mode='lastp',
distance_sort='ascending')
plt.show()
What are n, and count here?In a diagram like following I need to know who are listed under(3) and (2)?
I think the document is not very clear at this part and the sample code in it is not even operational. But it is clear that 1 means the 2nd observation and (3) means there are 3 observation in that node.
If you want to know what are the 3 obs. in the 2nd node, if that is your question:
In [51]:
D4=dendrogram(linkage_matrix,
color_threshold=1,
p=4,
truncate_mode='lastp',
distance_sort='ascending')
D7=dendrogram(linkage_matrix,
color_list=['g',]*7,
p=7,
truncate_mode='lastp',
distance_sort='ascending', no_plot=True)
from itertools import groupby
[list(group) for key, group in groupby(D7['ivl'],lambda x: x in D4['ivl'])]
Out[51]:
[['1'], ['6', '0', '3'], ['2'], ['4', '5']]
The 2nd node contains obs. 7th, 1th and 4th, and the 2th node contains the 5th and the 6th observations.