Filling up empty columns but has requirements [duplicate] - python

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?

use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0

Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0

Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop

combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN

I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)

For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible

Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s

Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Related

Rolling mean on a groupby with min_periods=1, but NOT ignoring NANs

So far, I've only seen questions being asked about how to ignore NANs while doing a rolling mean on a groupby. But my case is the opposite. I want to include the NANs such that if even one of the values in the rolling windows is NAN, I want the resulting rolling mean to be NAN as well.
Input:
grouping value_to_avg
0 1 1.0
1 1 2.0
2 1 3.0
3 1 NaN
4 1 4.0
5 2 5.0
6 2 NaN
7 2 6.0
8 2 7.0
9 2 8.0
Code to create sample input:
data = {'grouping': [1,1,1,1,1,2,2,2,2,2], 'value_to_avg': [1,2,3,np.nan,4,5,np.nan,6,7,8]}
db = pd.DataFrame(data)
Code that I have tried:
db['rolling_mean_actual'] = db.groupby('grouping')['value_to_avg'].transform(lambda s: s.rolling(window=3, center=True, min_periods=1).mean(skipna=False))
Actual vs. expected output:
grouping value_to_avg rolling_mean_actual rolling_mean_expected
0 1 1.0 1.5 1.5
1 1 2.0 2.0 2.0
2 1 3.0 2.5 NaN
3 1 NaN 3.5 NaN
4 1 4.0 4.0 NaN
5 2 5.0 5.0 NaN
6 2 NaN 5.5 NaN
7 2 6.0 6.5 NaN
8 2 7.0 7.0 7.0
9 2 8.0 7.5 7.5
You can see above, using skipna=False inside the mean function does not work as expected and still ignores NANs
For me working custom function with np.mean with convert values to numpy array:
roll_window = 3
db['rolling_mean_actual'] = (db.groupby('grouping')['value_to_avg']
.transform(lambda s: s.rolling(roll_window,
center=True,
min_periods=1)
.apply(lambda x: np.mean(x.to_numpy())))
You can avoid transform also:
roll_window = 3
db['rolling_mean_actual'] = (db.groupby('grouping')['value_to_avg']
.rolling(roll_window, center=True, min_periods=1)
.apply(lambda x: np.mean(x.to_numpy()))
.droplevel(0))
print (db)
grouping value_to_avg rolling_mean_actual
0 1 1.0 1.5
1 1 2.0 2.0
2 1 3.0 NaN
3 1 NaN NaN
4 1 4.0 NaN
5 2 5.0 NaN
6 2 NaN NaN
7 2 6.0 NaN
8 2 7.0 7.0
9 2 8.0 7.5
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"grouping": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"value_to_avg": [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8],
}
)
pd.concat(
[
df,
df.groupby("grouping", as_index=False)
.rolling(window=3, center=True, min_periods=0)
.apply(lambda x: x.mean() if ~x.isna().any() else np.NaN).rename(columns={'value_to_avg': 'rolling avg'}),
],
axis=1,
).iloc[:, [0, 1, 3]]
>>>
grouping value_to_avg rolling avg
0 1 1.0 1.5
1 1 2.0 2.0
2 1 3.0 NaN
3 1 NaN NaN
4 1 4.0 NaN
5 2 5.0 NaN
6 2 NaN NaN
7 2 6.0 NaN
8 2 7.0 7.0
9 2 8.0 7.5

Append two dataframes, with some duplicate datetime.date index, choosing one dataframe over the other, using vectorization [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Pandas check if two columns are the same [duplicate]

This question already has an answer here:
python pandas : compare two columns for equality and result in third dataframe
(1 answer)
Closed last month.
df = {'A':[3, 4, 5, np.nan, 6, 7],
'B':[np.nan, 4, np.nan, np.nan, 6, 7]}
I have a data frame with two columns, A and B. I want to create a new column, C, which is the result of checking whether whether A and B are the same, if they are then keep it but if one is NaN, then keep the other value. Column A and B are always either a value or NaN. The values in A and B are always the same.
I know how to check whether A and B are the same:
df['C'] = (df['A'] == df['B]).astype('object')
But this gives a boolean answer in column C whether it's true or false. My expected output would be:
A B C
3 NaN 3
4 4 4
5 NaN 5
NaN NaN NaN
6 6 6
7 7 7
You can use np.where()
>>> df = pd.DataFrame({'A':[3, 4, 5, np.nan],'B':[np.nan,4,np.nan,np.nan]})
>>> df
A B
0 3.0 NaN
1 4.0 4.0
2 5.0 NaN
3 NaN NaN
>>> df['C'] = np.where(df['A'].isna(), df['B'], df['A'])
>>> df
A B C
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
Edited Sample
Showing that it would work if df['A'] is nan and df['B'] has value.
>>> df = pd.DataFrame({'A':[3, np.nan, 5, np.nan],'B':[np.nan,4,np.nan,np.nan]})
>>> df
A B
0 3.0 NaN
1 NaN 4.0
2 5.0 NaN
3 NaN NaN
>>> df['C'] = np.where(df['A'].isna(), df['B'], df['A'])
>>> df
A B C
0 3.0 NaN 3.0
1 NaN 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
Thanks :D
Use np.select where you can check multiple conditions.
df = pd.DataFrame({'A':[3, 4, 5, np.nan, 6, np.nan],
'B':[np.nan, 4, np.nan, np.nan, 6, 7]})
df['c'] = np.select([df['A'].isnull() & df['B'].isnull(), df['A'].isnull()],
[np.nan, df['B']], df['A'])
Output:
A B c
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
4 6.0 6.0 6.0
5 NaN 7.0 7.0
If it's guaranteed that A & B are identical values when not nans, then it looks like you could use .combine_first here:
df['C'] = df.A.combine_first(df.B)
I think fillna is sufficent for your requirement
df['C'] = df.A.fillna(df.B)
Out[92]:
A B C
0 3.0 NaN 3.0
1 4.0 4.0 4.0
2 5.0 NaN 5.0
3 NaN NaN NaN
4 6.0 6.0 6.0
5 7.0 7.0 7.0

Replacing empty values in a DataFrame with value of a column

Say I have the following pandas dataframe:
df = pd.DataFrame([[3, 2, np.nan, 0],
[5, 4, 2, np.nan],
[7, np.nan, np.nan, 5],
[9, 3, np.nan, 4]],
columns=list('ABCD'))
which returns this:
A B C D
0 3 2.0 NaN 0.0
1 5 4.0 2.0 NaN
2 7 NaN NaN 5.0
3 9 3.0 NaN 4.0
I'd like that if a np.nan is found, that the value is replaced by a value in the A column. So that would mean the result to be this:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
I've tried multiple things, but I could not get anything to work. Can anyone help?
Here is necessary double transpose:
cols = ['B','C', 'D']
df[cols] = df[cols].T.fillna(df['A']).T
print(df)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
because:
df[cols] = df[cols].fillna(df['A'], axis=1)
print(df)
NotImplementedError: Currently only can fill with dict/Series column by column
Another solution with numpy.where and broadcasting column A:
df = pd.DataFrame(np.where(df.isnull(), df['A'].values[:, None], df),
index=df.index,
columns=df.columns)
print (df)
A B C D
0 3.0 2.0 3.0 0.0
1 5.0 4.0 2.0 5.0
2 7.0 7.0 7.0 5.0
3 9.0 3.0 9.0 4.0
Thank you #pir for another solution:
df = pd.DataFrame(np.where(df.isnull(), df[['A']], df),
index=df.index,
columns=df.columns)
Currently, fillna doesn't allow for broadcasting a series across columns while aligning the indices.
pandas.DataFrame.mask
This functions exactly like what we'd want fillna to do. Finds the the nulls, fills it in with df.A along axis=0
df.mask(df.isna(), df.A, axis=0)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
pandas.DataFrame.fillna using a dictionary
However, you can pass a dictionary to fillna that tells it what to do for each column.
df.fillna({k: df.A for k in df})
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
DO fillna with reindex
df.fillna(df[['A']].reindex(columns=df.columns).ffill(1))
Out[20]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
Or combine_first
df.combine_first(df.fillna(0).add(df.A,0))
Out[35]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
# for each column...
for col in df.columns:
# I select the np.nan and I replace then with the value of A
df.loc[df[col].isnull(), col] = df["A"]

Coalesce values from 2 columns into a single column in a pandas dataframe

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Categories