merge in pandas and output only selected columns - python

Is there a way to do a merge in pandas limiting the columns you want to see?
What I have:
df1
ID Col1 Col2 Col3 Col4
1 1 1 1 D
2 A C C 4
3 B B B d
4 X 2 3 6
df2
ID ColA ColB ColC ColD
1 1 1 1 D
2 A C X 4
3 B B Y d
What I want:
df_final
ID ColA ColB ColC ColD
1 NA NA NA NA
2 A C X 4
3 B B Y d
4 NA NA NA NA
I want to do a left join on two dataframes (keeping all IDs from df1) but I only want to keep the columns from df2. I also only want values if Col3 from df1 is either C or B.
The following works but the resulting df includes all columns from both dfs.
I can add a third line to only see the columns I want but this is a simple example. In reality I have much larger datasets and its difficult to manually input all the column names I want to keep.
df=pd.merge(df1,df2,how='left',on='ID')
df_final=df[df['Col3'].isin['C','B']]
Equivalent SQL would be
create table df_final as
select b.*
from df1 a
left join df2 b
on a.ID=b.ID
where a.Col3 in ('C','B')

Mask df1 with your isin condition before the merge:
df1.where(df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID')
Or,
df1.mask(~df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID')
ID ColA ColB ColC ColD
0 NaN NaN NaN NaN NaN
1 2 A C X 4
2 3 B B Y d
3 NaN NaN NaN NaN NaN

This should do the trick
df=pd.merge(df1[df1.Col3.isin(['C','B'])][['ID']], df2, how='left', on='ID')

Related

Imputing values into a dataframe based on another dataframe and a condition

Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0

Concatenate only new values of first column of one dataframe to an other one

I can't find a proper way to concat only new values of colA. This is quite simple, I need new elements of column A to be added from DF2 to DF1
DF1
colA colB colC
a 5 7
b 4 5
c 5 6
DF2
colA colE colF
a 7 e
b d 4
c f g
d h h
e 4 r
I have tried with a simple code as this, but the output dataframe is not correct :
DF3 = pd.concat([DF1, DF2['ColA']], keys=["ColA"])
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
The result is that [a, 5, 7] is droped and replaced by [a, nan, nan]
What I need is this :
DF3 merged colA
colA colB colC
a 5 7
b 4 5
c 5 6
d
e
Then I fill DF3 missing values manually. I don't need colE neither colF in DF3.
You can use pandas.DataFrame.merge:
>>> DF1.merge(DF2, how='outer', on='colA').reindex(DF1.columns, axis=1)
colA colB colC
0 a 5.0 7.0
1 b 4.0 5.0
2 c 5.0 6.0
3 d NaN NaN
4 e NaN NaN
Edit
To remove NaN and convert other vals back to int, you can try:
>>> df.merge(df2['colA'], how='outer').fillna(-1, downcast='infer').replace({-1:''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e
# if -1 part is a concern, then, convert to "Int64"
>>> df.astype({'colB': 'Int64', 'colC': 'Int64'}).merge(df2['colA'], how='outer')
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d <NA> <NA>
4 e <NA> <NA>
# You can replace the NaN's with string as well:
>>> df.astype({
'colB': 'Int64',
'colC': 'Int64'
}).merge(df2['colA'], how='outer').replace({np.nan: ''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e
Remove keep='last' for default value keep='first':
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
to:
DF3.drop_duplicates(subset=['ColA'], inplace=True)
Or just outer merge DF2[['colA']]
DF1.merge(DF2[['colA']], how='outer')

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.
Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Merge small file into big file and give NaN's for the rows that do not match in python

I would like to merge two data frames - big one and small one. Example of data frames is following:
# small data frame construction
>>> d1 = {'col1': ['A', 'B'], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 A 3
1 B 4
# big data frame construction
>>> d2 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': [3, 4, 6, 7, 8]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 A 3
1 B 4
2 C 6
3 D 7
4 E 8
The code I am looking for should produce the following output (a data frame with big data frame shape, column names, and NaNs in rows that were not merged with the small data frame):
col1 col2
0 A 3
1 B 4
2 NA NA
3 NA NA
4 NA NA
The code I have tried:
>>> print(pd.merge(df1, df2, left_index=True, right_index=True, how='right', sort=False))
col1_x col2_x col1_y col2_y
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 5
3 NaN NaN D 6
4 NaN NaN E 7
You can add parameter suffixes with add _ for added columns and then removed added columns with Series.str.endswith, inverted mask by ~ and boolean indexing with loc, because droping columns:
df = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='right',
sort=False,
suffixes=('','_'))
print (df)
col1 col2 col1_ col2_
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 6
3 NaN NaN D 7
4 NaN NaN E 8
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
col1 col2
0 A 3.0
1 B 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

Set dataframe column using values from matching indices in another dataframe

I would like to set values in col2 of DF1 using the value held at the matching index of col2 in DF2:
DF1:
col1 col2
index
0 a
1 b
2 c
3 d
4 e
5 f
DF2:
col1 col2
index
2 a x
3 d y
5 f z
DF3:
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If I just try and set DF1['col2'] = DF2['col2'] then col2 comes out as all NaN values in DF3 - I take it this is because the indices are different. However when I try and use map() to do something like:
DF1.index.to_series().map(DF2['col2'])
then I still get the same NaN column, but I thought it would map the values over where the index matches...
What am I not getting?
You need join or assign:
df = df1.join(df2['col2'])
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
Or:
df1 = df1.assign(col2=df2['col2'])
#same like
#df1['col2'] = df2['col2']
print (df1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If no match and all values are NaNs check if indices have same dtype in both df:
print (df1.index.dtype)
print (df2.index.dtype)
If not, then use astype:
df1.index = df1.index.astype(int)
df2.index = df2.index.astype(int)
Bad solution (check index 2):
df = df2.combine_first(df1)
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 a x
3 d y
4 e NaN
5 f z
You can simply concat as you are combining based on index
df = pd.concat([df1['col1'], df2['col2']],axis = 1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z

Categories