Forward fill only certain value - python

I have an array which represents object states, where 0 - object is off, and 1 - object is on.
import pandas as pd
import numpy as np
s = [np.nan, 0, np.nan, np.nan, 1, np.nan, np.nan, 0, np.nan, 1, np.nan]
df = pd.DataFrame(s, columns=["s"])
df
s
0 NaN
1 0.0
2 NaN
3 NaN
4 1.0
5 NaN
6 NaN
7 0.0
8 NaN
9 1.0
10 NaN
I need to forward will only 0-values in it, like below.
>>> df_wanted
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
After browsing similar queations here, I just compare ffill-ed and bfill-ed values and assign back with a mask:
mask = (df.ffill() == 0) & (df.bfill() == 1)
df[mask] = 0
df
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
But it won't help if any 0 value is not followed by 1. What could be more elegant solution that takes such cases into account?

mask = (df.ffill() == 0) should only be suffice to fulfill your usecase.
Firstly, df.ffill will propagate the last valid observation forward. So rows followed by 0 will be filled by 0s, and rows followed by 1 will be filled by 1s. Compare that to 0 to select rows with 0s only and use it as mask to get your final df.
Example: (Added a 0 and few NaNs to the end of your df)
>>> s = [np.nan, 0, np.nan, np.nan, 1, np.nan, np.nan, 0, np.nan, 1, np.nan, np.nan, 0, np.nan, np.nan, np.nan]
>>> df = pd.DataFrame(s, columns=["s"])
>>> df
s
0 NaN
1 0.0
2 NaN
3 NaN
4 1.0
5 NaN
6 NaN
7 0.0
8 NaN
9 1.0
10 NaN
11 NaN
12 0.0
13 NaN
14 NaN
15 NaN
>>>
>>>
>>> df[df.ffill() == 0] = 0
>>> df
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
11 NaN
12 0.0
13 0.0
14 0.0
15 0.0

One way, maybe not much elegant but that works for you, would be to just ffill with everything and then pick from it where your original series was NaN and your ffilled series is 0.
sf = df.ffill().values[:, 0]
desired = np.where(np.isnan(s) & (sf==0), sf, s)
pandas has a where function too, I'm just more comfortable with numpy since it's more versatile.

Related

find first non NaN value in shift pandas

I have a following issue. I would like to compute lag of a column in my df. However, I have a condition that the lagged value cannot my nan.
See example bellow:
import numpy as np
d = {'col1': [1, 2, 10, 5, 3, 2], 'col2': [3, 4, np.nan, np.nan, 23, 42]}
df = pd.DataFrame(data=d)
when I try this:
df["col2_lag"] = df["col2"].shift(1)
I got this result:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN NaN
4 3 23.0 NaN
5 2 42.0 23.0
However, desired output is this:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN 4.0 #because we skip NaN and find first non NaN
4 3 23.0 4.0 #because we skip NaN and find first non NaN
5 2 42.0 23.0
Is there and elegant way, how to do this? Ideally without writting my own function. Thanks
Use ffill:
df["col2_lag"] = df["col2"].shift(1).ffill()

Pandas - Fill NaN using multiple values

I have a column ( lets call it Column X) containing around 16000 NaN values. The column has two possible values, 1 or 0 ( so like a binary )
I want to fill the NaN values in column X, but i don't want to use a single value for ALL the NaN entries.
say for instance that; i want to fill 50% of the NaN values with '1' and the other 50% with '0'.
I have read the ' fillna() ' documentation but i have not found any such relevant information which could satisfy this functionality.
I have literally no idea on how to move forward regarding this problem, so i haven't tried anything.
df['Column_x'] = df['Column_x'].fillna(df['Column_x'].mode()[0], inplace= True)
but this would fill ALL the NaN values in Column X of my dataframe 'df' with the mode of the column, i want to fill 50% with one value and other 50% with a different value.
Since i haven't tried anything yet, i can't show or describe any actual results.
what i can tell is that the expected result would be something along the lines of 8000 NaN values of column x replaced with '1' and another 8000 with '0' .
A visual result would be something like;
Before Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
After Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 0.0
13 0.0
14 0.0
15 0.0
16 1.0
17 1.0
18 1.0
19 1.0
You can use random.choices with its weights parameter to ensure the distribution stays the same. I've simulated a NaN column with numpy here, and get the exact length of the replacement needed. This approach can also be used for columns with more than two classes and more complex distributions.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'col1': range(16000)})
df['col2'] = np.nan
nans = df['col2'].isna()
length = sum(nans)
replacement = random.choices([0, 1], weights=[.5, .5], k=length)
df.loc[nans,'col2'] = replacement
print(df.describe())
'''
Out:
col1 col2
count 16000.000000 16000.000000
mean 7999.500000 0.507625
std 4618.946489 0.499957
min 0.000000 0.000000
25% 3999.750000 0.000000
50% 7999.500000 1.000000
75% 11999.250000 1.000000
max 15999.000000 1.000000
'''
Using pandas.Series.sample:
mask = df['Column_x'].isna()
ind = df['Column_x'].loc[mask].sample(frac=0.5).index
df.loc[ind, 'Column_x'] = 1
df['Column_x'] = df['Column_x'].fillna(0)
print(df)
Output:
Index Column_x
0 0 0.0
1 1 0.0
2 2 0.0
3 3 0.0
4 4 0.0
5 5 0.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 1.0
11 11 1.0
12 12 1.0
13 13 0.0
14 14 1.0
15 15 0.0
16 16 0.0
17 17 1.0
18 18 1.0
19 19 0.0
Use slicing columns and fill value
isnull() - function detect missing values in the given series object
Ex.
import pandas as pd
df = pd.DataFrame({'Column_y': pd.Series(range(9), index=['a', 'b', 'c','d','e','f','g','h','i']),
'Column_x': pd.Series(range(1), index=['a'])})
print(df)
# get list of index series which have NaN Column_x value
idx = df['Column_x'].index[df['Column_x'].isnull()]
total_nan_len = len(idx)
first_nan = total_nan_len//2
# fill first 50% of 1
df.loc[idx[0:first_nan], 'Column_x'] = 1
# fill last 50% of 0
df.loc[idx[first_nan:total_nan_len], 'Column_x'] = 0
print(df)
O/P:
Before Dataframe
Column_y Column_x
a 0 0.0
b 1 NaN
c 2 NaN
d 3 NaN
e 4 NaN
f 5 NaN
g 6 NaN
h 7 NaN
i 8 NaN
After Dataframe
Column_y Column_x
a 0 0.0
b 1 1.0
c 2 1.0
d 3 1.0
e 4 1.0
f 5 0.0
g 6 0.0
h 7 0.0
i 8 0.0

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

If I have a Pandas data frame like this:
1 2 3 4 5 6 7
1 NaN 1 1 1 NaN 1 1
2 NaN NaN 1 1 1 1 1
3 NaN NaN NaN 1 NaN 1 1
4 1 1 NaN NaN 1 1 NaN
How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:
1 2 3 4 5 6 7
1 NaN 1 2 3 NaN 1 2
2 NaN NaN 1 2 3 4 5
3 NaN NaN NaN 1 NaN 1 2
4 1 2 NaN NaN 1 2 NaN
You could do:
# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)
# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)
# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)
# set the result
result = (cumulative - restart)
result[mask] = np.nan
# display the result
print(result)
Output
1 2 3 4 5 6 7
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
You can do with stack and unstack
s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1 2.0
2 NaN NaN 1.0 2.0 3.0 4 5.0
3 NaN NaN NaN 1.0 NaN 1 2.0
4 3.0 4.0 NaN NaN 1.0 2 NaN
I came up with a slightly different answer here that might be helpful.
For as single series I made this function to to do the cumsum-reset on nulls.
def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
"""
For a pandas series with null values,
do a cumsum and reset the cumulative sum when a null value is encountered.
Example)
input: [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
return: [1, 2, 0, 1, 3, 6, 0, 1, 0]
"""
cumulative = srs.cumsum().fillna(method='ffill')
restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
.fillna(method='ffill').fillna(0))
result = (cumulative - restart)
return result.replace(0, np.nan)
Then for the full dataframe, just apply this function row-wise
df = pd.DataFrame([
[np.nan, 1, 1, 1, np.nan, 1, 1],
[np.nan, np.nan, 1, 1, 1, 1, 1],
[np.nan, np.nan, np.nan, 1, np.nan, 1, 1],
[1, 1, np.nan, np.nan, 1, 1, np.nan],
])
df.apply(cumsum_reset_on_null, axis=1)
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
One of the way can be:
sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])
Output of sample
1 2 3 4 5 6 7
1 NaN 1.0 1.0 1.0 NaN 1 1.0
2 NaN NaN 1.0 1.0 1.0 1 1.0
3 NaN NaN NaN 1.0 NaN 1 1.0
4 1.0 1.0 NaN NaN 1.0 1 NaN
Following code would do:
#numr = number of rows
#numc = number of columns
numr,numc = sample.shape
for i in range(numr):
s=0
flag=0
for j in range(numc):
if np.isnan(sample.iloc[i,j]):
flag=1
else:
if flag==1:
s=sample.iloc[i,j]
flag=0
else:
s+=sample.iloc[i,j]
sample.iloc[i,j]=s
Output:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1.0 2.0
2 NaN NaN 1.0 2.0 3.0 4.0 5.0
3 NaN NaN NaN 1.0 NaN 1.0 2.0
4 1.0 2.0 NaN NaN 1.0 2.0 NaN

How to fill nan values with rolling mean in pandas

I have a dataframe which contains nan values at few places. I am trying to perform data cleaning in which I fill the nan values with mean of it's previous five instances. To do so, I have come up with the following.
input_data_frame[var_list].fillna(input_data_frame[var_list].rolling(5).mean(), inplace=True)
But, this is not working. It isn't filling the nan values. There is no change in the dataframe's null count before and after the above operation. Assuming I have a dataframe with just integer column, How can I fill NaN values with mean of the previous five instances? Thanks in advance.
This should work:
input_data_frame[var_list]= input_data_frame[var_list].fillna(pd.rolling_mean(input_data_frame[var_list], 6, min_periods=1))
Note that the window is 6 because it includes the value of NaN itself (which is not counted in the average). Also the other NaN values are not used for the averages, so if less that 5 values are found in the window, the average is calculated on the actual values.
Example:
df = {'a': [1, 1,2,3,4,5, np.nan, 1, 1, 2, 3, 4, 5, np.nan] }
df = pd.DataFrame(data=df)
print df
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 NaN
Output:
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 3.0
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 3.0
rolling_mean function has been modified in pandas. If you fill the entire dataset, you can use;
filled_dataset = dataset.fillna(dataset.rolling(6,min_periods=1).mean())
you can simply use interpolate()
df = {'a': [1,5, np.nan, np.nan, np.nan, 2, 5, np.nan] }
df = pd.DataFrame(data=df)
print(df)
df['a'].interpolate()

Coalesce values from 2 columns into a single column in a pandas dataframe

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Categories