Python Pandas - Evenly distribute numeric values to nearest rows - python

Suppose I have a data set like:
> NaN NaN NaN 12 NaN NaN NaN NaN 10 NaN NaN NaN NaN 8 NaN 6 NaN
I want to distribute the values as evenly as possible between values of their surrounding NaNs. For example the value 12 should take into consideration of their surrounding NaNs, and distribute them evenly until it touches the 2nd non-NaN value's NaNs.
For example the 1st 12 should only take into consideration of his closest NaNs.
> NaN NaN NaN 12 NaN NaN
The output should be:
2 2 2 2 2 (Distributed by the 12)
2 2 2 2 2 (Distributed by the 10)
2 2 2 2 (Distributed by the 8)
2 2 2 (Distributed by the 6)
> NaN NaN NaN 12 NaN NaN NaN NaN 10 NaN NaN NaN NaN 8 NaN 6 NaN
> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
I was originally thinking about using smoothers, such as the interpolate function in Pandas. It does not have to be lossless, meaning that we can lose or get more than the sum in the progress. Are there any libraries that can perform this kind of distribution vs using a lossy smoother?

You can use interpolate(method='nearest'), ffill() and bfill() and finally groupby().
Short version:
>> series = pd.Series(x).interpolate(method='nearest').ffill().bfill()
>> series.groupby(series).apply(lambda k: k/len(k))
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0]
To illustrate what's happening, create your df
df = pd.DataFrame()
df["x"] = x
where x is the series you gave. Now:
>>> df["inter"] = df.x.interpolate(method='nearest').ffill().bfill()
>>> df["inter"] = df.groupby("inter").inter.apply(lambda k: k/len(k))
>>> df
x inter
0 NaN 2.0
1 NaN 2.0
2 NaN 2.0
3 12.0 2.0
4 NaN 2.0
5 NaN 2.0
6 NaN 2.0
7 NaN 2.0
8 10.0 2.0
9 NaN 2.0
10 NaN 2.0
11 NaN 2.0
12 NaN 2.0
13 8.0 2.0
14 NaN 2.0
15 6.0 3.0
16 NaN 3.0

Related

Is there a way to forward fill with ascending logic in pandas / numpy?

What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

Fill NaN with the closest not NaN columns values in Python

I want to fill e column's NaN with its most closest (by position from left side) not NaN columns' values.
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN NaN
2 2 4.0 NaN NaN NaN
3 5 6.0 NaN NaN NaN
4 3 NaN NaN NaN NaN
For example, for the second row of e, its most closest Not NaN column is e by position, then we take 7.0, is it possible to do this in Pandas? Thanks.
The expected output is like this:
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN 7.0
2 2 4.0 NaN NaN 4.0
3 5 6.0 NaN NaN 6.0
4 3 NaN NaN NaN 3.0
If answer should be simplify get all first non missing values from left side to last column use forward filling them and select last column by position:
df.e = df.ffill(axis=1).iloc[:, -1]

I want to subtract each column from the previous non-null column using the diff function

I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0

Pandas index interpolation filling in missing values after the last data point

Having a data frame with missing values at the end of a column, f.e.:
df = pd.DataFrame({'a':[np.nan,1,2,np.nan,np.nan,5,np.nan,np.nan]}, index=[0,1,2,3,4,5,6,7])
a
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
Using 'index' interpolation method:
df.interpolate(method='index')
Returns the data frame with the last missing values forward filled:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 5.0
7 5.0
Is there a way to turn off that behaviour and leave the last missing values:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
I think need new parameter limit_direction in 0.23.0+, check this:
df = df.interpolate(method='index', limit=1, limit_direction='backward')
print (df)
a
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
EDIT: If want replace NaNs only inside add parameter limit_area:
df = df.interpolate(method='index',limit_area='inside')
print (df)
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
Do you mean that the last NaNs(one or more) should be remained?
How about this.
Find the last valid arg index and split and interpolate and append.
valargmax=np.max(np.where((df.isnull().eq(False).values==True).flatten()==True))
r = df[0:(valargmax+1)].interpolate(method='index').append(df[(valargmax+1):])
print(r)

Categories