computing with nan's with numpy's ma module - python

I do not understand the behavior of this numpy.ma.max (min, mean, etc.)
import numpy as np
arr = np.ma.array([0,np.nan,1])
np.ma.max(arr)
-> nan
I thought this was supposed to return a value excluding nan's? The only way I can get a real value is
np.nanmax(np.asarray(arr))
Is this right, or am I using numpy.ma.max incorrectly?

You need to create the mask:
import numpy as np
arr = np.ma.array([0,np.nan,1])
print(np.ma.max(arr))
# >>>nan # since there is no mask
marr = np.ma.masked_array([0,np.nan,1], np.isnan(arr))
print(np.ma.max(marr))
# >>>1.0 # since the mask tells mask to ignore the nan. The max of the rest (0,1) is 1.

A straightforward way to create the mask is to use the np.ma.masked_invalid function (see. http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.masked_invalid.html#numpy.ma.masked_invalid)
Here is an example:
# Makes example reproducible
np.random.seed(seed=1337)
# Generate some data
X = np.random.random((5,5))
X[X > .5] = np.nan
print X
array([[ 0.26202468, 0.15868397, 0.27812652, 0.45931689, 0.32100054],
[ nan, 0.26194293, nan, nan, 0.11527423],
[ 0.38627507, nan, 0.12505793, nan, 0.44322487],
[ nan, nan, 0.36126157, 0.41610394, nan],
[ nan, 0.18780841, 0.28816715, nan, 0.49964826]])
# Mask will hide both np.nan and np.inf values
masked_X = np.ma.masked_invalid(X, copy=False)
# Voila
print np.max(masked_X, axis=0)
masked_array(data = [0.38627506863435945 0.26194292556514465 0.36126157241743073
0.45931688721456665 0.49964826137201246],
mask = [False False False False False],
fill_value = 1e+20)

Related

Numba JIT changing results if values are printed

I started working with numba today, mainly because I have a nested for-loop that can take quite a while with regular python code.
I have a macports version of python-2.7 with llvm-3.6 and the pip version of numba (everything is up-to-date)
Here is the code I'm using:
import pandas as pd
from numba import jit
from numpy import nan, full
#jit
def movingAverage(adj_close, maxMA):
ma = full([len(adj_close), maxMA], nan, dtype=float64)
ind = range( 1, len(adj_close)+1 )
for d in ind:
m = max( 0, d-maxMA-1)
adj = adj_close[d-1:m:-1] if (m or d==maxMA+1) else adj_close[d-1::-1]
cs = adj.cumsum()
for i in range( len(adj) ):
ma[d-1][i] = ( cs[i] / (i+1) )
print ma
return ma
I'm calculating a rolling mean for the input adj_close for up to maxMA days.
adj_close is a array of values, one value per day
I started by creating ma, a holder for the values that are going to be calculated. And work out the vaules for each day individually (note that the first day can only have an average involving 1 day, the second, 2 and so on up to the maxMA)
If I input something like adj_close = array(range(5), dtype=float64) and maxMA = 3 get the right answer as follows:
array([[ 0., nan, nan],
[ 1., 0.5, nan],
[ 2., 1.5, 1.],
[ 3., 2.5, 2.],
[ 4., 3.5, 3.]])
However, If I take out the print ma line, just before the return of my function, it returns only part of the answer:
array([[ nan, nan, nan],
[ nan, nan, nan],
[ nan, nan, nan],
[ 3., 2.5, 2.],
[ 4., 3.5, 3.]])
Why is that happening? Why does #jit needs the print between those loops to get the answer right? What can I do to get rid of the print statement (that greatly increases the runtime)?
Edit: I'm accepting #JoshAdel suggestion and opened a issue at Numba's github. I'm, therefore, accepting #MSeifert answer as the workaround solved the problem for me.
I think numba does something strange here but probably because of the mixture of python and nopython mode. If I use Python 3.5 the returns are identical with and without print.
For python 2.7 I think the problem is because the for-loop is either compiled in nopython mode (without print) or in python mode (with print). But then converted to python when it exits the loop. But that's just guessing. But I tried it with:
import pandas as pd
from numba import jit
from numpy import nan, full
import numpy as np
#jit
def movingAverage(adj_close, maxMA):
ma = full([len(adj_close), maxMA], nan, dtype=np.float64)
ind = range( 1, len(adj_close)+1 )
for d in ind:
m = max( 0, d-maxMA-1)
adj = adj_close[d-1:m:-1] if (m or d==maxMA+1) else adj_close[d-1::-1]
cs = adj.cumsum()
for i in range( len(adj) ):
ma[d-1][i] = ( cs[i] / (i+1) )
if d == ind[-1]:
return ma # notice that I return it after the last loop but before the loop terminates.
#return ma
and it does return:
array([[ 0., nan, nan],
[ 1., 0.5, nan],
[ 2., 1.5, 1.],
[ 3., 2.5, 2.],
[ 4., 3.5, 3.]])
This is however not a very effient way because of the recalculation of len(adj_close)+1. This could be stored somewhere.

numpy pad array with nan, getting strange float instead

I'm trying to pad an array with np.nan
import numpy as np
print np.version.version
# 1.10.2
combine = lambda real, theo: np.vstack((theo, np.pad(real, (0, theo.shape[0] - real.shape[0]), 'constant', constant_values=np.nan)))
real = np.arange(20)
theoretical = np.linspace(0, 20, 100)
result = combine(real, theoretical)
np.any(np.isnan(result))
# False
Inspecting result, it seems instead of np.nan, the array is getting padded with -9.22337204e+18. What's going on here? How can I get np.nan?
The result of pad has the same type as the input. np.nan is a float
In [874]: np.pad(np.ones(2,dtype=int),1,mode='constant',constant_values=(np.nan,))
Out[874]: array([-2147483648, 1, 1, -2147483648])
In [875]: np.pad(np.ones(2,dtype=float),1,mode='constant',constant_values=(np.nan,))
Out[875]: array([ nan, 1., 1., nan])
The int pad is np.nan cast as an integer:
In [878]: np.array(np.nan).astype(int)
Out[878]: array(-2147483648)

Array-Based Numpy 3d Array Assignment

Take a 2D numpy.array, let's say:
mat = numpy.random.rand(3,3)
In [153]: mat
Out[153]:
array([[ 0.16716156, 0.90822617, 0.83888038],
[ 0.89771815, 0.62627978, 0.34992542],
[ 0.11097042, 0.80858005, 0.0437299 ]])
Changes the indices to numpy.nan is quite straight forward
One of the following works great:
In [154]: diag = numpy.diag_indices(mat.shape[0], ndim = 2)
In [155]: mat[diag] = numpy.nan
or
In [156]: numpy.fill_diagonal(mat, numpy.nan)
But let's say I have a 3D array, where I want the exact same process along every dimension of the 3rd dimension.
mat = numpy.random.rand(3, 5, 5)
In [158]: mat
Out[158]:
array([[[ 0.65000325, 0.71059547, 0.31880388, 0.24818623, 0.57722849],
[ 0.26908326, 0.41962004, 0.78642476, 0.25711662, 0.8662998 ],
[ 0.15332566, 0.12633147, 0.54032977, 0.17322095, 0.17210078],
[ 0.81952873, 0.20751669, 0.73514815, 0.00884358, 0.89222687],
[ 0.62775839, 0.53657471, 0.99611842, 0.75051645, 0.59328044]],
[[ 0.28718216, 0.84982865, 0.27830082, 0.90604492, 0.43119512],
[ 0.43039373, 0.76557782, 0.58089787, 0.81135684, 0.39151152],
[ 0.70592711, 0.30625204, 0.9753166 , 0.32806864, 0.21947731],
[ 0.74600317, 0.33711673, 0.16203076, 0.6002213 , 0.74996638],
[ 0.63555715, 0.71719058, 0.81420001, 0.28968442, 0.01368163]],
[[ 0.06474027, 0.51966572, 0.006429 , 0.98590784, 0.35708074],
[ 0.44977222, 0.63719921, 0.88325451, 0.53820139, 0.51526687],
[ 0.98529117, 0.46219441, 0.09349748, 0.11406291, 0.47697128],
[ 0.77446136, 0.87423445, 0.71810465, 0.39019846, 0.94070077],
[ 0.09154989, 0.36295161, 0.19740833, 0.17803146, 0.6498038 ]]])
A logical way to do that (I would think), is:
mat[:, diag] = numpy.nan # doesn't do it
In fact, to accomplish this, I need to:
In [190]: rng = numpy.arange(5)
In [191]: for i in numpy.arange(mat.shape[0]):
.....: mat[i, rng, rng] = numpy.nan
.....:
In [192]: mat
Out[192]:
array([[[ nan, 0.4040426 , 0.89449522, 0.63593736, 0.94922036],
[ 0.40682651, nan, 0.30812181, 0.01726625, 0.75655994],
[ 0.23925763, 0.41476223, nan, 0.91590111, 0.18391644],
[ 0.99784977, 0.71636554, 0.21252766, nan, 0.24195636],
[ 0.41137357, 0.84705055, 0.60086461, 0.16403918, nan]],
[[ nan, 0.26183712, 0.77621913, 0.5479058 , 0.17142263],
[ 0.17969373, nan, 0.89742863, 0.65698339, 0.95817106],
[ 0.79048886, 0.16365168, nan, 0.97394435, 0.80612441],
[ 0.94169129, 0.10895737, 0.92614597, nan, 0.08689534],
[ 0.20324943, 0.91402716, 0.23112819, 0.2556875 , nan]],
[[ nan, 0.43177039, 0.76901587, 0.82069345, 0.64351534],
[ 0.14148584, nan, 0.35820379, 0.17434688, 0.78884305],
[ 0.85232784, 0.93526843, nan, 0.80981366, 0.57326785],
[ 0.82104636, 0.63453196, 0.5872653 , nan, 0.96214559],
[ 0.69959383, 0.70257404, 0.92471502, 0.50077728, nan]]])
It's for an application where speed is of the utmost importance, so if there isn't an array based implementation of the following, I'm going to do the for-loop / assignment in Cython
This seems to work:
diag = numpy.diag_indices(mat.shape[1], ndim = 2)
mat[:, diag[0], diag[1]] = numpy.nan
The problem is that diag is a 2-element tuple, so using it as-is in a 3D index won't work, and using *diag us unfortunately invalid syntax. However, you can also do this:
diag = (Ellipsis, *numpy.diag_indices(mat.shape[-1], ndim = 2))
mat[diag] = numpy.nan
In this case, diag is the three-element tuple you need to use it as an index. Ellipsis is the object that represents : repeated as many times as necessary in the index. This version will work for any number of dimensions >2 where the last two represent the square matrices you want.
Using linear indexing -
m,n,r = mat.shape
mat.reshape(m,-1)[:,np.arange(r)*(r+1)] = np.nan
Using slicing and boolean indexing -
m,n,r = mat.shape
mat.reshape(m,-1)[:,np.eye(n,r,dtype=bool).ravel()] = np.nan

python numpy weighted average with nans

First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.

Interpolate NaN values in a numpy array

Is there a quick way of replacing all NaN values in a numpy array with (say) the linearly interpolated values?
For example,
[1 1 1 nan nan 2 2 nan 0]
would be converted into
[1 1 1 1.3 1.6 2 2 1 0]
Lets define first a simple helper function in order to make it more straightforward to handle indices and logical indices of NaNs:
import numpy as np
def nan_helper(y):
"""Helper to handle indices and logical indices of NaNs.
Input:
- y, 1d numpy array with possible NaNs
Output:
- nans, logical indices of NaNs
- index, a function, with signature indices= index(logical_indices),
to convert logical indices of NaNs to 'equivalent' indices
Example:
>>> # linear interpolation of NaNs
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
"""
return np.isnan(y), lambda z: z.nonzero()[0]
Now the nan_helper(.) can now be utilized like:
>>> y= array([1, 1, 1, NaN, NaN, 2, 2, NaN, 0])
>>>
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
>>>
>>> print y.round(2)
[ 1. 1. 1. 1.33 1.67 2. 2. 1. 0. ]
---
Although it may seem first a little bit overkill to specify a separate function to do just things like this:
>>> nans, x= np.isnan(y), lambda z: z.nonzero()[0]
it will eventually pay dividends.
So, whenever you are working with NaNs related data, just encapsulate all the (new NaN related) functionality needed, under some specific helper function(s). Your code base will be more coherent and readable, because it follows easily understandable idioms.
Interpolation, indeed, is a nice context to see how NaN handling is done, but similar techniques are utilized in various other contexts as well.
I came up with this code:
import numpy as np
nan = np.nan
A = np.array([1, nan, nan, 2, 2, nan, 0])
ok = -np.isnan(A)
xp = ok.ravel().nonzero()[0]
fp = A[-np.isnan(A)]
x = np.isnan(A).ravel().nonzero()[0]
A[np.isnan(A)] = np.interp(x, xp, fp)
print A
It prints
[ 1. 1.33333333 1.66666667 2. 2. 1. 0. ]
Just use numpy logical and there where statement to apply a 1D interpolation.
import numpy as np
from scipy import interpolate
def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
f = interpolate.interp1d(inds[good], A[good],bounds_error=False)
B = np.where(np.isfinite(A),A,f(inds))
return B
For two dimensional data, the SciPy's griddata works fairly well for me:
>>> import numpy as np
>>> from scipy.interpolate import griddata
>>>
>>> # SETUP
>>> a = np.arange(25).reshape((5, 5)).astype(float)
>>> a
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
>>> a[np.random.randint(2, size=(5, 5)).astype(bool)] = np.NaN
>>> a
array([[ nan, nan, nan, 3., 4.],
[ nan, 6., 7., nan, nan],
[ 10., nan, nan, 13., nan],
[ 15., 16., 17., nan, 19.],
[ nan, nan, 22., 23., nan]])
>>>
>>> # THE INTERPOLATION
>>> x, y = np.indices(a.shape)
>>> interp = np.array(a)
>>> interp[np.isnan(interp)] = griddata(
... (x[~np.isnan(a)], y[~np.isnan(a)]), # points we know
... a[~np.isnan(a)], # values we know
... (x[np.isnan(a)], y[np.isnan(a)])) # points to interpolate
>>> interp
array([[ nan, nan, nan, 3., 4.],
[ nan, 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ nan, nan, 22., 23., nan]])
I am using it on 3D images, operating on 2D slices (4000 slices of 350x350). The whole operation still takes about an hour :/
Or building on Winston's answer
def pad(data):
bad_indexes = np.isnan(data)
good_indexes = np.logical_not(bad_indexes)
good_data = data[good_indexes]
interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
data[bad_indexes] = interpolated
return data
A = np.array([[1, 20, 300],
[nan, nan, nan],
[3, 40, 500]])
A = np.apply_along_axis(pad, 0, A)
print A
Result
[[ 1. 20. 300.]
[ 2. 30. 400.]
[ 3. 40. 500.]]
It might be easier to change how the data is being generated in the first place, but if not:
bad_indexes = np.isnan(data)
Create a boolean array indicating where the nans are
good_indexes = np.logical_not(bad_indexes)
Create a boolean array indicating where the good values area
good_data = data[good_indexes]
A restricted version of the original data excluding the nans
interpolated = np.interp(bad_indexes.nonzero(), good_indexes.nonzero(), good_data)
Run all the bad indexes through interpolation
data[bad_indexes] = interpolated
Replace the original data with the interpolated values.
I use the interpolation for replacing all NaN values.
A = np.array([1, nan, nan, 2, 2, nan, 0])
np.interp(np.arange(len(A)),
np.arange(len(A))[np.isnan(A) == False],
A[np.isnan(A) == False])
Output :
array([1. , 1.33333333, 1.66666667, 2. , 2. , 1. , 0. ])
I needed an approach that would also fill in NaN's at the start of end of the data, which the main answer does not appear to do.
The function I came up with uses a linear regression to fill in the NaN's. This overcomes my problem:
import numpy as np
def linearly_interpolate_nans(y):
# Fit a linear regression to the non-nan y values
# Create X matrix for linreg with an intercept and an index
X = np.vstack((np.ones(len(y)), np.arange(len(y))))
# Get the non-NaN values of X and y
X_fit = X[:, ~np.isnan(y)]
y_fit = y[~np.isnan(y)].reshape(-1, 1)
# Estimate the coefficients of the linear regression
beta = np.linalg.lstsq(X_fit.T, y_fit)[0]
# Fill in all the nan values using the predicted coefficients
y.flat[np.isnan(y)] = np.dot(X[:, np.isnan(y)].T, beta)
return y
Here's an example usage case:
# Make an array according to some linear function
y = np.arange(12) * 1.5 + 10.
# First and last value are NaN
y[0] = np.nan
y[-1] = np.nan
# 30% of other values are NaN
for i in range(len(y)):
if np.random.rand() > 0.7:
y[i] = np.nan
# NaN's are filled in!
print (y)
print (linearly_interpolate_nans(y))
Slightly optimized version based on response of BRYAN WOODS. It handles starting and ending values of source data correctly, and it is faster on 25-30% than original version. Also you may use different kinds of interpolations (see scipy.interpolate.interp1d documentations for details).
import numpy as np
from scipy.interpolate import interp1d
def fill_nans_scipy1(padata, pkind='linear'):
"""
Interpolates data to fill nan values
Parameters:
padata : nd array
source data with np.NaN values
Returns:
nd array
resulting data with interpolated values instead of nans
"""
aindexes = np.arange(padata.shape[0])
agood_indexes, = np.where(np.isfinite(padata))
f = interp1d(agood_indexes
, padata[agood_indexes]
, bounds_error=False
, copy=False
, fill_value="extrapolate"
, kind=pkind)
return f(aindexes)
In [17]: adata = np.array([1, 2, np.NaN, 4])
Out[18]: array([ 1., 2., nan, 4.])
In [19]: fill_nans_scipy1(adata)
Out[19]: array([1., 2., 3., 4.])
Building on the answer by Bryan Woods, I modified his code to also convert lists consisting only of NaN to a list of zeros:
def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
if len(good[0]) == 0:
return np.nan_to_num(A)
f = interp1d(inds[good], A[good], bounds_error=False)
B = np.where(np.isfinite(A), A, f(inds))
return B
Simple addition, I hope it will be of use to someone.
Interpolation and extrapolation with padding keywords
The following solution interpolates the nan values in an array by np.interp, if a finite value is present on both sides. Nan values at the borders are handled by np.pad with modes like constant or reflect.
import numpy as np
import matplotlib.pyplot as plt
def extrainterpolate_nans_1d(
arr, kws_pad=({'mode': 'edge'}, {'mode': 'edge'})
):
"""Interpolates and extrapolates nan values.
Interpolation is linear, compare np.interp(..).
Extrapolation works with pad keywords, compare np.pad(..).
Parameters
----------
arr : np.ndarray, shape (N,)
Array to replace nans in.
kws_pad : dict or (dict, dict)
kwargs for np.pad on left and right side
Returns
-------
bool
Description of return value
See Also
--------
https://numpy.org/doc/stable/reference/generated/numpy.interp.html
https://numpy.org/doc/stable/reference/generated/numpy.pad.html
https://stackoverflow.com/a/43821453/7128154
"""
assert arr.ndim == 1
if isinstance(kws_pad, dict):
kws_pad_left = kws_pad
kws_pad_right = kws_pad
else:
assert len(kws_pad) == 2
assert isinstance(kws_pad[0], dict)
assert isinstance(kws_pad[1], dict)
kws_pad_left = kws_pad[0]
kws_pad_right = kws_pad[1]
arr_ip = arr.copy()
# interpolation
inds = np.arange(len(arr_ip))
nan_msk = np.isnan(arr_ip)
arr_ip[nan_msk] = np.interp(inds[nan_msk], inds[~nan_msk], arr[~nan_msk])
# detemine pad range
i0 = next(
(ids for ids, val in np.ndenumerate(arr) if not np.isnan(val)), 0)[0]
i1 = next(
(ids for ids, val in np.ndenumerate(arr[::-1]) if not np.isnan(val)), 0)[0]
i1 = len(arr) - i1
# print('pad in range [0:{:}] and [{:}:{:}]'.format(i0, i1, len(arr)))
# pad
arr_pad = np.pad(
arr_ip[i0:], pad_width=[(i0, 0)], **kws_pad_left)
arr_pad = np.pad(
arr_pad[:i1], pad_width=[(0, len(arr) - i1)], **kws_pad_right)
return arr_pad
# setup data
ys = np.arange(30, dtype=float)**2/20
ys[:5] = np.nan
ys[20:] = 20
ys[28:] = np.nan
ys[[7, 13, 14, 18, 22]] = np.nan
ys_ie0 = extrainterpolate_nans_1d(ys)
kws_pad_sym = {'mode': 'symmetric'}
kws_pad_const7 = {'mode': 'constant', 'constant_values':7.}
ys_ie1 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_sym, kws_pad_const7))
ys_ie2 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_const7, kws_pad_sym))
fig, ax = plt.subplots()
ax.scatter(np.arange(len(ys)), ys, s=15**2, label='ys')
ax.scatter(np.arange(len(ys)), ys_ie0, s=8**2, label='ys_ie0, left_pad edge, right_pad edge')
ax.scatter(np.arange(len(ys)), ys_ie1, s=6**2, label='ys_ie1, left_pad symmetric, right_pad 7')
ax.scatter(np.arange(len(ys)), ys_ie2, s=4**2, label='ys_ie2, left_pad 7, right_pad symmetric')
ax.legend()
As suggested by an earlier comment, the best way to do this is to use a peer reviewed implementation. The pandas library has an interpolation method for 1d data, which interpolates np.nan values in Series or DataFrame:
pandas.Series.interpolate or pandas.DataFrame.interpolate
The documentation is very concise, recommend reading through! My implementation:
import pandas as pd
magnitudes_series = pd.Series(magnitudes) # Convert np.array to pd.Series
magnitudes_series.interpolate(
# I used "akima" because the second derivative of my data has frequent drops to 0
method=interpolation_method,
# Interpolate from both sides of the sequence, up to you (made sense for my data)
limit_direction="both",
# Interpolate only np.nan sequences that have number sequences at the ends of the respective np.nan sequences
limit_area="inside",
inplace=True,
)
# I chose to remove np.nan at the tails of data sequence
magnitudes_series.dropna(inplace=True)
result_in_numpy_array = magnitudes_series.values
Importing scipy looks like overkill to me. Here's a simple way using numpy and maintaining the same conventions as np.interp
def interp_nans(x:[float],left=None, right=None, period=None)->[float]:
"""
e.g. [1 1 1 nan nan 2 2 nan 0] -> [1 1 1 1.3 1.6 2 2 1 0]
"""
xp = [i for i, yi in enumerate(x) if np.isfinite(yi)]
fp = [yi for i, yi in enumerate(x) if np.isfinite(yi)]
return list(np.interp(x=list(range(len(x))), xp=xp, fp=fp,left=left,right=right,period=period))

Categories