Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Related
I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?
Perform the merges in the preferred order, and use combine_first to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2 to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C
I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
I am attempting to merge two columns which contain strings and nans.
When I attempt to merge them I cannot deal with the nan values.
df['col3] = df['col1'] + df['col2']
returns only my col2 values
df['col3'] = df['col1'].map(str) + df['col2'].map(str)
returns my nans attached to each other.
If I don't use .map(str) then the .nan values don't concatenate at all.
Is there a way to concatenate two dataframe columns so that if either of them are nan they aren't concatenated. Unless both are nan in which case I do want the return nan.
Example:
df
col0 col1 col2 col3
X A nan A
Y nan B B
Z nan nan nan
W '' B B
You could index first on the last two columns and ffill:
df['col3'] = df[['col1', 'col2']].ffill(1).col2
col0 col1 col2 col3
0 X A NaN A
1 Y NaN B B
2 Z NaN NaN NaN
3 W '' B B
This is fillna
df['col3']=df.col2.fillna(df.col1)
Apply np.where and in case both values exist I have combine both
import numpy as np
df = pd.DataFrame(data={"col1":["A",np.nan,"B",np.nan,"C"],
"col2":[np.nan,"B",np.nan,np.nan,"d"]})
df['col3'] = np.where(df['col1'].isnull(),df['col2'],
np.where(df['col2'].isnull(),df['col1'],df['col1']+df['col2']))
col1 col2 col3
0 A NaN A
1 NaN B B
2 B NaN B
3 NaN NaN NaN
4 C d Cd
fillna() and replace() is what you're looking for, here's fully working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ["A", "B", "C", np.nan],
'col2': ["D", "E", np.nan, np.nan]
})
df['col3']= df['col1'].fillna('') + df['col2'].fillna('')
df['col3'] = df['col3'].replace('', np.nan)
print(df)
It'll first replace NaN values with empty string and then if both are empty, will replace it back with NaN.
Output:
col1 col2 col3
0 A D AD
1 B E BE
2 C NaN C
3 NaN NaN NaN
I have a data frame like this:
df
col1 col2 col3 col4
A 12 34 XX
B 20 25 PP
B nan nan nan
nan P 54 nan
nan R nan nan
nan nan nan PQ
C D 32 SS
R S 32 RS
If col1 value is null I want to add all the values of other columns untill it finds the notnull element in col1
The data frame i am looking for should look like:
col1 col2 col3 col4
A 12 34 XX
B 20 25 PP
B PR 54 PQ
C D 32 SS
R S 32 RS
How to do in in most efficient way using python/pandas
If want processes all columns like strings first forward filling missing values in col1, replace NaNs to empty strings, convert all values to strings and use sum:
df['col1'] = df['col1'].ffill()
df = df.set_index('col1').fillna('').astype(str).sum(level=0).reset_index()
print (df)
col1 col2 col3 col4
0 A 12 34.0 XX
1 B PR 54.0 PQ
2 C D 32.0 SS
print (df.dtypes)
col1 object
col2 object
col3 object
col4 object
dtype: object
If need processes only numeric columns with aggregate method, e.g. mean use lambda function with if-else:
df['col1'] = df['col1'].ffill()
c = df.select_dtypes(object).columns
df[c] = df[c].fillna('')
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ''.join(x)
df = df.groupby('col1').agg(f).reset_index()
print (df)
col1 col2 col3 col4
0 A 12 34.0 XX
1 B PR 54.0 PQ
2 C D 32.0 SS
print (df.dtypes)
col1 object
col2 object
col3 float64
col4 object
dtype: object
EDIT: New helper column is used:
df['new'] = df['col1'].notna().cumsum()
df['col1'] = df['col1'].ffill()
c = df.select_dtypes(object).columns
df[c] = df[c].fillna('')
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ''.join(x)
df = df.groupby(['col1', 'new']).agg(f).reset_index(level=1, drop=True).reset_index()
I have a python dictionary
{1:cat,
2:dog,
3:sheep,
4:foo,
5:bar,
6:fish,
7:lion,
8:shark,
9:zebra,
10:snake}
Also I have pandas dataframe as following
df:
ID col1 col2 col2 col4
18938 1 Nan 5 Nan
17839 Nan 2 Nan 8
72902 3 5 9 Nan
78298 7 Nan Nan 6
Now I am trying to replace or map the values of each cell in each column the dictionary values and trying to concat all the column values to a new column.
The new df should look like:
ID col1 col2 col2 col4 new_col
18938 cat Nan bar Nan cat|bar
17839 Nan dog Nan shark dog|shark
72902 sheep bar zebra Nan sheep|bar|zebra
78298 lion Nan Nan fish lion|fish
I am trying to achieve the 2nd step which is concat all the columns using the code
df['new_col'] = df.drop('ID',1).agg(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but I am unable to get the first step working
I used
df = df.columns.map(dict)
but it is not giving me the expected answer I need.
You could try this:
df = df.set_index('ID')
d1 = pd.concat([df[i].replace('Nan',pd.np.nan).dropna().astype(int).map(d) for i in df.columns], axis=1)
d1['new_col'] = d1.apply(lambda x: '|'.join(x.dropna()), axis=1)
print(d1)
Or if you want a little slower but more concise code:
d1 = df.apply(lambda x: x.replace('Nan',pd.np.nan).dropna().astype(int).map(d))
d1['new_col'] = d1.apply(lambda x: '|'.join(x.dropna()), axis=1)
d1
Output:
col1 col2 col2.1 col4 new_col
ID
17839 NaN dog NaN shark dog|shark
18938 cat NaN bar NaN cat|bar
72902 sheep bar zebra NaN sheep|bar|zebra
78298 lion NaN NaN fish lion|fish
Use df.replace():
df = df.replace(dict)
Note that if the keys in your dictionary are strings you may need regex=True:
df = df.replace(dict, regex=True)
Example:
import pandas as pd
d = {1:"cat",
2:"dog",
3:"sheep",
4:"foo",
5:"bar",
6:"fish",
7:"lion",
8:"shark",
9:"zebra",
10:"snake"}
df = pd.DataFrame({'ID': [123, 456], 'col1': [1, 2], 'col2': [5, 6]})
df = df.replace(d)
print(df)
Output:
ID col1 col2
0 123 cat bar
1 456 dog fish
Pardon my poorly phrased question--I'm not sure how to word it.
Given this pandas pivot table,
df = pd.DataFrame({'col1': list('AABB'),
'col2': list('acab'),
'values': [1,3,4,5]})
pt = pd.pivot_table(df,
index=['col1', 'col2'],
values='values',
aggfunc=sum)
Output:
col1 col2
A a 1
c 3
B a 4
b 5
How can I make the pivot table output this instead:
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN
If you convert your column to the category data type (new in pandas 0.15!) you will get the aggregation that you are after:
df.col2 = df.col2.astype('category')
In [378]: df.groupby(['col1','col2']).sum()
Out[378]:
values
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN