Pandas dynamically replace nan values - python

I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0

Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:

Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0

Related

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

Using fillna to replace missing data

When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.

Create a column that has the same length of the longest column in the data at the same time

I have the following data:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
dataFrame = pandas.DataFrame(data).transpose()
Output:
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 NaN 4.0 4.0
4 NaN 5.0 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
Is it possible to create a 4th column AT THE SAME TIME the others columns are created in data, which has the same length as the longest column of this dataframe (3rd one)?
The data of this column doesn't matter. Assume it's 8. So this is the desired output can be:
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
In my script the dataframe keeps changing every time. This means the longest columns keeps changing with it.
Thanks for reading
This is quite similar to answers from #jpp, #Cleb, and maybe some other answers here, just slightly simpler:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]] + [[]]
This will automatically give you a column of NaNs that is the same length as the longest columnn, so you don't need the extra work of calculating the length of the longest column. Resulting dataframe:
0 1 2 3
0 1.0 1.0 1.0 NaN
1 2.0 2.0 2.0 NaN
2 3.0 3.0 3.0 NaN
3 NaN 4.0 4.0 NaN
4 NaN 5.0 5.0 NaN
5 NaN NaN 6.0 NaN
6 NaN NaN 7.0 NaN
Note that this answer is less general than some others here (such as by #jpp & #Cleb) in that it will only fill with NaNs. If you want some default fill values other than NaN, you should use one of their answers.
You can append to a list which then immediately feeds the pd.DataFrame constructor:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data + [[8]*max(map(len, data))]).transpose()
print(df)
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
But this is inefficient. Pandas uses NumPy to hold underlying series and setting a series to a constant value is trivial and efficient; you can simply use:
df[3] = 8
It is not entirely clear what you mean by at the same time, but the following would work:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
# get the longest list in data
data.append([8] * max(map(len, data)))
pd.DataFrame(data).transpose()
yielding
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If you'd like to do it as you create the DataFrame, simply chain a call to assign:
pd.DataFrame(data).T.assign(**{'3': 8})
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
You can do a def (read comments):
def f(df):
l=[8]*df[max(df,key=lambda x:df[x].count())].count()
df[3]=l+[np.nan]*(len(df)-len(l))
# the above two lines can be just `df[3] = another solution currently for this problem`
return df
dataFrame = f(pandas.DataFrame(data).transpose())
Then now:
print(dataFrame)
Returns:
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
If at you mean at the same time as running pd.DataFrame, the data has to be prepped before it is loaded to your frame.
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
longest = max(len(i) for i in data)
dummy = [8 for i in range(longest)] #dummy data filled with 8
data.append(dummy)
dataFrame = pd.DataFrame(data).transpose()
The example above gets the longest element in your list and creates a dummy to be added onto it before creating your dataframe.
One solution is to add an element to the list that is passed to the dataframe:
pd.DataFrame(data + [[np.hstack(data).max() + 1] * len(max(data))]).T
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If data is to be modified just:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
data = data + [[np.hstack(data).max() + 1] * len(max(data))]
pd.DataFrame(data).T

pandas dataframe row proportions

I have a dataframe with multiple columns and rows
For all columns I need to say the row value is equal to 0.5 of this row + 0.5 of the row befores value.
I currently set up a loop which is working. But I feel there is a better way without using a loop. Does anyone have any thoughts?
dataframe = df_input
df_output=df_input.copy()
for i in range(1, df_input.shape[0]):
try:
df_output.iloc[[i]]= (df_input.iloc[[i-1]]*(1/2)).values+(df_input.iloc[[i]]*(1/2)).values
except:
pass
Do you mean sth like this:
First creating test data:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 20, [5, 3]), columns=['A', 'B', 'C'])
A B C
0 6 19 14
1 10 7 6
2 18 10 10
3 3 7 2
4 1 11 5
Your requested function:
(df*.5).rolling(2).sum()
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
EDIT:
for an unbalanced sum you can define an auxiliary function:
def weighted_mean(arr):
return sum(arr*[.25, .75])
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
EDIT2:
...and if the weights should be to be set at runtime:
def weighted_mean(arr, weights=[.5, .5]):
return sum(arr*weights/sum(weights))
No additional argument defaults to balanced mean:
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
An unbalanced mean:
df.rolling(2).apply(weighted_mean, raw=True, args=[[.25, .75]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
The division by sum(weights) enables the definition of weights not only restricted to fractions of one, but by any ratio:
df.rolling(2).apply(weighted_mean, raw=True, args=[[1, 3]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
df.rolling(window=2, min_periods=1).apply(lambda x: x[0]*0.5 + x[1] if len(x) > 1 else x)
This will do the same operation for all columns.
Explanation: For each rolling object the lambda chooses the columns and x are structured like [this_col[i], this_col[i+1]] for all cols, and then doing custom arithmetic is straightforward.
Some
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 1)), columns=['a'])
df["cumsum_a"] = 0.5*df["a"].cumsum() + 0.5*df["a"]
thing like below?

Pandas index interpolation filling in missing values after the last data point

Having a data frame with missing values at the end of a column, f.e.:
df = pd.DataFrame({'a':[np.nan,1,2,np.nan,np.nan,5,np.nan,np.nan]}, index=[0,1,2,3,4,5,6,7])
a
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
Using 'index' interpolation method:
df.interpolate(method='index')
Returns the data frame with the last missing values forward filled:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 5.0
7 5.0
Is there a way to turn off that behaviour and leave the last missing values:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
I think need new parameter limit_direction in 0.23.0+, check this:
df = df.interpolate(method='index', limit=1, limit_direction='backward')
print (df)
a
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
EDIT: If want replace NaNs only inside add parameter limit_area:
df = df.interpolate(method='index',limit_area='inside')
print (df)
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
Do you mean that the last NaNs(one or more) should be remained?
How about this.
Find the last valid arg index and split and interpolate and append.
valargmax=np.max(np.where((df.isnull().eq(False).values==True).flatten()==True))
r = df[0:(valargmax+1)].interpolate(method='index').append(df[(valargmax+1):])
print(r)

Categories