Pandas index interpolation filling in missing values after the last data point - python

Having a data frame with missing values at the end of a column, f.e.:
df = pd.DataFrame({'a':[np.nan,1,2,np.nan,np.nan,5,np.nan,np.nan]}, index=[0,1,2,3,4,5,6,7])
a
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
Using 'index' interpolation method:
df.interpolate(method='index')
Returns the data frame with the last missing values forward filled:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 5.0
7 5.0
Is there a way to turn off that behaviour and leave the last missing values:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN

I think need new parameter limit_direction in 0.23.0+, check this:
df = df.interpolate(method='index', limit=1, limit_direction='backward')
print (df)
a
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
EDIT: If want replace NaNs only inside add parameter limit_area:
df = df.interpolate(method='index',limit_area='inside')
print (df)
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN

Do you mean that the last NaNs(one or more) should be remained?
How about this.
Find the last valid arg index and split and interpolate and append.
valargmax=np.max(np.where((df.isnull().eq(False).values==True).flatten()==True))
r = df[0:(valargmax+1)].interpolate(method='index').append(df[(valargmax+1):])
print(r)

Related

Backfill column values using real value divided by number of preceding NA values in Pandas

test_df = pd.DataFrame({'a':[np.nan,np.nan,np.nan,4,np.nan,np.nan,6]})
test_df
a
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 NaN
6 6.0
I'm trying to backfill with the real value divided by the number of na values + itself. The following is what I'm trying to get
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
Try:
# identify the blocks by cumsum on the reversed non-nan series
groups = test_df['a'].notna()[::-1].cumsum()
# groupby and transform
test_df['a'] = test_df['a'].fillna(0).groupby(groups).transform('mean')
Output:
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
IIUC use:
# get reverse group
group = test_df.loc[::-1,'a'].notna().cumsum()
# get size and divide
test_df['a'] = (test_df['a']
.bfill()
.div(test_df.groupby(group)['a'].transform('size'))
)
Or with rdiv:
test_df['a'] = (test_df
.groupby(group)['a']
.transform('size')
.rdiv(test_df['a'].bfill())
)
Output (as new column for clarity):
a a2
0 NaN 1.0
1 NaN 1.0
2 NaN 1.0
3 4.0 1.0
4 NaN 2.0
5 NaN 2.0
6 6.0 2.0

Is there a way to forward fill with ascending logic in pandas / numpy?

What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

How to move values over in each Pandas data frame row where np.nan are located?

If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.

I want to subtract each column from the previous non-null column using the diff function

I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0

Categories