How to back fillna only one na value with a specific value - python

I want to back fillna, but I only want to backfill only one na value and replace that value with a specific value (1)
I tried using
df.fillna(value=1,method='bfill',inplace=True,limit=1)
but I get
ValueError: Cannot specify both 'value' and 'method'.
because I cannot use method and value at the same time. If this was possible, I would not be asking this question (pandas possibly should look into this with a new update)
Here is an example:
import pandas as pd
import numpy as np
col1 = [3,2,2,np.nan,np.nan,np.nan,2,6,np.nan,np.nan,np.nan,6]
col2 = [8,2,np.nan,np.nan,6,0,np.nan,5,np.nan,6,6,3]
col3 = [np.nan,np.nan,np.nan,np.nan,6,7,np.nan,1,np.nan,np.nan,3,4]
df = pd.DataFrame(data=[col1,col2,col3],columns=['col1','col2','col3'])
print(df)
index col1 col2 col3
0 3 8 np.nan
1 2 2 np.nan
2 2 np.nan np.nan
3 np.nan np.nan np.nan
4 np.nan 6 6
5 np.nan 0 7
6 2 np.nan np.nan
7 6 5 1
8 np.nan np.nan np.nan
9 np.nan 6 np.nan
10 np.nan 6 3
11 6 3 4
here is my desired output:
index col1 col2 col3
0 3 8 np.nan
1 2 2 np.nan
2 2 np.nan np.nan
3 np.nan 1 1
4 np.nan 6 6
5 1 0 7
6 2 1 1
7 6 5 1
8 np.nan 1 np.nan
9 np.nan 6 1
10 1 6 3
11 6 3 4
I've been at this for hours. Anything is appreciated!

You can bfill with limit 1, doesnt matter which value. Then you check which value is filled, but is still NaN in your original dataframe. Those indices you fill in 1:
d = df.bfill(limit=1)
mask = df.isna() & d.notna()
df = pd.DataFrame(np.where(mask, 1, df), columns=df.columns)
Output
col1 col2 col3
0 3.0 8.0 NaN
1 2.0 2.0 NaN
2 2.0 NaN NaN
3 NaN 1.0 1.0
4 NaN 6.0 6.0
5 1.0 0.0 7.0
6 2.0 1.0 1.0
7 6.0 5.0 1.0
8 NaN 1.0 NaN
9 NaN 6.0 1.0
10 1.0 6.0 3.0
11 6.0 3.0 4.0

Apparently ffill cannot handle specifying both value and method. Here's an alternative approach:
m = (df.isna() & df.shift(-1).notna()).shift().fillna(False)
pd.DataFrame(np.where(m, 1, df), columns=df.columns)
col1 col2 col3
0 3.0 8.0 NaN
1 2.0 2.0 NaN
2 2.0 NaN NaN
3 NaN 1.0 1.0
4 NaN 6.0 6.0
5 1.0 0.0 7.0
6 2.0 5.0 1.0
7 6.0 5.0 1.0
8 NaN 6.0 NaN
9 NaN 6.0 1.0
10 1.0 6.0 3.0
11 6.0 3.0 4.0

Related

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

Find cumulative sums of each grouping in a row and then set the grouping equal to the maximum sum

If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0

How to reset cumprod when na's are in for pandas column

I have a 2 columns in a dataframe that I want to calculate the cumprod for both, but the cumprod needs to restart once it sees an na in the cell
I have tried using cumprod straightforwardly, but it's not getting me the correct values because the cumprod is continuous and not restarting when the na shows up
Here is an exaple df
index col1 col2
0 2 4
1 6 4
2 1 na
3 2 7
4 na 6
5 na 8
6 5 na
7 8 9
8 3 2
here is my desired output:
index col1 col2
0 2 4
1 12 16
2 12 na
3 24 7
4 na 42
5 na 336
6 5 na
7 40 9
8 240 18
Here is a solution that operates on each column and concats back together, since the masks are different for each column.
pd.concat(
[df[col].groupby(df[col].isnull().cumsum()).cumprod() for col in df.columns], axis=1)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
A slightly more efficient approach is to calculate the grouper mask all at once and use zip
m = df.isnull().cumsum()
pd.concat(
[df[col].groupby(mask).cumprod() for col, mask in zip(df.columns, m.values.T)], axis=1)
Here's a similar solution with dict comprehension and the default constructor
pd.DataFrame({c: df[c].groupby(df[c].isna().cumsum()).cumprod() for c in df.columns})
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
You can use groupby with isna and cumsum to get groups to comprod over in each column using apply:
df.apply(lambda x: x.groupby(x.isna().cumsum()).cumprod())
Output:
col1 col2
index
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
Here is a solution without operating column by column:
df = pd.DataFrame([[2,4], [6,4], [1,np.nan], [2,7], [np.nan,6], [np.nan,8], [5,np.nan], [8,9], [3,2]],
columns=['col1', 'col2'])
df_cumprod = df.cumprod()
adjust_factor = df_cumprod.fillna(method='ffill').where(df_cumprod.isnull()).fillna(method='ffill').fillna(1)
print(df_cumprod / adjust_factor)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0

Pandas index interpolation filling in missing values after the last data point

Having a data frame with missing values at the end of a column, f.e.:
df = pd.DataFrame({'a':[np.nan,1,2,np.nan,np.nan,5,np.nan,np.nan]}, index=[0,1,2,3,4,5,6,7])
a
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
Using 'index' interpolation method:
df.interpolate(method='index')
Returns the data frame with the last missing values forward filled:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 5.0
7 5.0
Is there a way to turn off that behaviour and leave the last missing values:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
I think need new parameter limit_direction in 0.23.0+, check this:
df = df.interpolate(method='index', limit=1, limit_direction='backward')
print (df)
a
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
EDIT: If want replace NaNs only inside add parameter limit_area:
df = df.interpolate(method='index',limit_area='inside')
print (df)
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
Do you mean that the last NaNs(one or more) should be remained?
How about this.
Find the last valid arg index and split and interpolate and append.
valargmax=np.max(np.where((df.isnull().eq(False).values==True).flatten()==True))
r = df[0:(valargmax+1)].interpolate(method='index').append(df[(valargmax+1):])
print(r)

Forward fill Pandas df only if an entire line is made of Nan

I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0

Categories