How to fill nan values with rolling mean in pandas - python

I have a dataframe which contains nan values at few places. I am trying to perform data cleaning in which I fill the nan values with mean of it's previous five instances. To do so, I have come up with the following.
input_data_frame[var_list].fillna(input_data_frame[var_list].rolling(5).mean(), inplace=True)
But, this is not working. It isn't filling the nan values. There is no change in the dataframe's null count before and after the above operation. Assuming I have a dataframe with just integer column, How can I fill NaN values with mean of the previous five instances? Thanks in advance.

This should work:
input_data_frame[var_list]= input_data_frame[var_list].fillna(pd.rolling_mean(input_data_frame[var_list], 6, min_periods=1))
Note that the window is 6 because it includes the value of NaN itself (which is not counted in the average). Also the other NaN values are not used for the averages, so if less that 5 values are found in the window, the average is calculated on the actual values.
Example:
df = {'a': [1, 1,2,3,4,5, np.nan, 1, 1, 2, 3, 4, 5, np.nan] }
df = pd.DataFrame(data=df)
print df
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 NaN
Output:
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 3.0
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 3.0

rolling_mean function has been modified in pandas. If you fill the entire dataset, you can use;
filled_dataset = dataset.fillna(dataset.rolling(6,min_periods=1).mean())

you can simply use interpolate()
df = {'a': [1,5, np.nan, np.nan, np.nan, 2, 5, np.nan] }
df = pd.DataFrame(data=df)
print(df)
df['a'].interpolate()

Related

Filling up empty columns but has requirements [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

find first non NaN value in shift pandas

I have a following issue. I would like to compute lag of a column in my df. However, I have a condition that the lagged value cannot my nan.
See example bellow:
import numpy as np
d = {'col1': [1, 2, 10, 5, 3, 2], 'col2': [3, 4, np.nan, np.nan, 23, 42]}
df = pd.DataFrame(data=d)
when I try this:
df["col2_lag"] = df["col2"].shift(1)
I got this result:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN NaN
4 3 23.0 NaN
5 2 42.0 23.0
However, desired output is this:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN 4.0 #because we skip NaN and find first non NaN
4 3 23.0 4.0 #because we skip NaN and find first non NaN
5 2 42.0 23.0
Is there and elegant way, how to do this? Ideally without writting my own function. Thanks
Use ffill:
df["col2_lag"] = df["col2"].shift(1).ffill()

Append two dataframes, with some duplicate datetime.date index, choosing one dataframe over the other, using vectorization [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Pandas check if two columns are the same [duplicate]

This question already has an answer here:
python pandas : compare two columns for equality and result in third dataframe
(1 answer)
Closed last month.
df = {'A':[3, 4, 5, np.nan, 6, 7],
'B':[np.nan, 4, np.nan, np.nan, 6, 7]}
I have a data frame with two columns, A and B. I want to create a new column, C, which is the result of checking whether whether A and B are the same, if they are then keep it but if one is NaN, then keep the other value. Column A and B are always either a value or NaN. The values in A and B are always the same.
I know how to check whether A and B are the same:
df['C'] = (df['A'] == df['B]).astype('object')
But this gives a boolean answer in column C whether it's true or false. My expected output would be:
A B C
3 NaN 3
4 4 4
5 NaN 5
NaN NaN NaN
6 6 6
7 7 7
You can use np.where()
>>> df = pd.DataFrame({'A':[3, 4, 5, np.nan],'B':[np.nan,4,np.nan,np.nan]})
>>> df
A B
0 3.0 NaN
1 4.0 4.0
2 5.0 NaN
3 NaN NaN
>>> df['C'] = np.where(df['A'].isna(), df['B'], df['A'])
>>> df
A B C
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
Edited Sample
Showing that it would work if df['A'] is nan and df['B'] has value.
>>> df = pd.DataFrame({'A':[3, np.nan, 5, np.nan],'B':[np.nan,4,np.nan,np.nan]})
>>> df
A B
0 3.0 NaN
1 NaN 4.0
2 5.0 NaN
3 NaN NaN
>>> df['C'] = np.where(df['A'].isna(), df['B'], df['A'])
>>> df
A B C
0 3.0 NaN 3.0
1 NaN 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
Thanks :D
Use np.select where you can check multiple conditions.
df = pd.DataFrame({'A':[3, 4, 5, np.nan, 6, np.nan],
'B':[np.nan, 4, np.nan, np.nan, 6, 7]})
df['c'] = np.select([df['A'].isnull() & df['B'].isnull(), df['A'].isnull()],
[np.nan, df['B']], df['A'])
Output:
A B c
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
4 6.0 6.0 6.0
5 NaN 7.0 7.0
If it's guaranteed that A & B are identical values when not nans, then it looks like you could use .combine_first here:
df['C'] = df.A.combine_first(df.B)
I think fillna is sufficent for your requirement
df['C'] = df.A.fillna(df.B)
Out[92]:
A B C
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
4 6.0 6.0 6.0
5 7.0 7.0 7.0

Coalesce values from 2 columns into a single column in a pandas dataframe

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Categories