merge in pandas and output only selected columns

merge in pandas and output only selected columns - python

Is there a way to do a merge in pandas limiting the columns you want to see?
What I have:
df1
ID Col1 Col2 Col3 Col4
1 1 1 1 D
2 A C C 4
3 B B B d
4 X 2 3 6
df2
ID ColA ColB ColC ColD
1 1 1 1 D
2 A C X 4
3 B B Y d
What I want:
df_final
ID ColA ColB ColC ColD
1 NA NA NA NA
2 A C X 4
3 B B Y d
4 NA NA NA NA
I want to do a left join on two dataframes (keeping all IDs from df1) but I only want to keep the columns from df2. I also only want values if Col3 from df1 is either C or B.
The following works but the resulting df includes all columns from both dfs.
I can add a third line to only see the columns I want but this is a simple example. In reality I have much larger datasets and its difficult to manually input all the column names I want to keep.
df=pd.merge(df1,df2,how='left',on='ID')
df_final=df[df['Col3'].isin['C','B']]
Equivalent SQL would be
create table df_final as
select b.*
from df1 a
left join df2 b
on a.ID=b.ID
where a.Col3 in ('C','B')

Mask df1 with your isin condition before the merge:
df1.where(df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID')
Or,
df1.mask(~df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID')
ID ColA ColB ColC ColD
0 NaN NaN NaN NaN NaN
1 2 A C X 4
2 3 B B Y d
3 NaN NaN NaN NaN NaN

This should do the trick
df=pd.merge(df1[df1.Col3.isin(['C','B'])][['ID']], df2, how='left', on='ID')

Related

Imputing values into a dataframe based on another dataframe and a condition

Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2

Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0

Concatenate only new values of first column of one dataframe to an other one

I can't find a proper way to concat only new values of colA. This is quite simple, I need new elements of column A to be added from DF2 to DF1
DF1
colA colB colC
a 5 7
b 4 5
c 5 6
DF2
colA colE colF
a 7 e
b d 4
c f g
d h h
e 4 r
I have tried with a simple code as this, but the output dataframe is not correct :
DF3 = pd.concat([DF1, DF2['ColA']], keys=["ColA"])
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
The result is that [a, 5, 7] is droped and replaced by [a, nan, nan]
What I need is this :
DF3 merged colA
colA colB colC
a 5 7
b 4 5
c 5 6
d
e
Then I fill DF3 missing values manually. I don't need colE neither colF in DF3.

You can use pandas.DataFrame.merge:
>>> DF1.merge(DF2, how='outer', on='colA').reindex(DF1.columns, axis=1)
colA colB colC
0 a 5.0 7.0
1 b 4.0 5.0
2 c 5.0 6.0
3 d NaN NaN
4 e NaN NaN
Edit
To remove NaN and convert other vals back to int, you can try:
>>> df.merge(df2['colA'], how='outer').fillna(-1, downcast='infer').replace({-1:''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e
# if -1 part is a concern, then, convert to "Int64"
>>> df.astype({'colB': 'Int64', 'colC': 'Int64'}).merge(df2['colA'], how='outer')
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d <NA> <NA>
4 e <NA> <NA>
# You can replace the NaN's with string as well:
>>> df.astype({
'colB': 'Int64',
'colC': 'Int64'
}).merge(df2['colA'], how='outer').replace({np.nan: ''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e

Remove keep='last' for default value keep='first':
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
to:
DF3.drop_duplicates(subset=['ColA'], inplace=True)

Or just outer merge DF2[['colA']]
DF1.merge(DF2[['colA']], how='outer')

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Merge small file into big file and give NaN's for the rows that do not match in python

I would like to merge two data frames - big one and small one. Example of data frames is following:
# small data frame construction
>>> d1 = {'col1': ['A', 'B'], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 A 3
1 B 4
# big data frame construction
>>> d2 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': [3, 4, 6, 7, 8]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 A 3
1 B 4
2 C 6
3 D 7
4 E 8
The code I am looking for should produce the following output (a data frame with big data frame shape, column names, and NaNs in rows that were not merged with the small data frame):
col1 col2
0 A 3
1 B 4
2 NA NA
3 NA NA
4 NA NA
The code I have tried:
>>> print(pd.merge(df1, df2, left_index=True, right_index=True, how='right', sort=False))
col1_x col2_x col1_y col2_y
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 5
3 NaN NaN D 6
4 NaN NaN E 7

You can add parameter suffixes with add _ for added columns and then removed added columns with Series.str.endswith, inverted mask by ~ and boolean indexing with loc, because droping columns:
df = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='right',
sort=False,
suffixes=('','_'))
print (df)
col1 col2 col1_ col2_
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 6
3 NaN NaN D 7
4 NaN NaN E 8
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
col1 col2
0 A 3.0
1 B 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

Set dataframe column using values from matching indices in another dataframe

I would like to set values in col2 of DF1 using the value held at the matching index of col2 in DF2:
DF1:
col1 col2
index
0 a
1 b
2 c
3 d
4 e
5 f
DF2:
col1 col2
index
2 a x
3 d y
5 f z
DF3:
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If I just try and set DF1['col2'] = DF2['col2'] then col2 comes out as all NaN values in DF3 - I take it this is because the indices are different. However when I try and use map() to do something like:
DF1.index.to_series().map(DF2['col2'])
then I still get the same NaN column, but I thought it would map the values over where the index matches...
What am I not getting?

You need join or assign:
df = df1.join(df2['col2'])
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
Or:
df1 = df1.assign(col2=df2['col2'])
#same like
#df1['col2'] = df2['col2']
print (df1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If no match and all values are NaNs check if indices have same dtype in both df:
print (df1.index.dtype)
print (df2.index.dtype)
If not, then use astype:
df1.index = df1.index.astype(int)
df2.index = df2.index.astype(int)
Bad solution (check index 2):
df = df2.combine_first(df1)
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 a x
3 d y
4 e NaN
5 f z

You can simply concat as you are combining based on index
df = pd.concat([df1['col1'], df2['col2']],axis = 1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge in pandas and output only selected columns - python

Mask df1 with your isin condition before the merge: df1.where(df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID') Or, df1.mask(~df1.Col3.isin(['C', 'B']))[['ID']].merge(df2, how='left', on='ID') ID ColA ColB ColC ColD 0 NaN NaN NaN NaN NaN 1 2 A C X 4 2 3 B B Y d 3 NaN NaN NaN NaN NaN

This should do the trick df=pd.merge(df1[df1.Col3.isin(['C','B'])][['ID']], df2, how='left', on='ID')

Related

Imputing values into a dataframe based on another dataframe and a condition

Concatenate only new values of first column of one dataframe to an other one

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

Merge small file into big file and give NaN's for the rows that do not match in python

Set dataframe column using values from matching indices in another dataframe

Categories

Resources