Merge two DataFrame but update the original columns - python

I would like to merge two dataframes on 'key'. When the right contains the same key as left I would like left to update with what's in right's matching column ('A' column).
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3']})
right = pd.DataFrame({'key': ['K0', 'K2'], 'A': ['new', 'new']})
left.merge(right, on="key", how="outer")
outputs:
key A_x A_y
0 K0 A0 new
1 K1 A1 NaN
2 K2 A2 new
3 K3 A3 NaN
placing suffixes: 'A_x' and 'A_y'
however desired output is:
key A
0 K0 new
1 K1 A1
2 K2 new
3 K3 A3
What is needed for column A to merge on key values that are the same in left and right dataframes?

One painless way is using update:
u = left.set_index('key')
u.update(right.set_index('key'))
u.reset_index()
key A
0 K0 new
1 K1 A1
2 K2 new
3 K3 A3
If the "key" column is unique, you can also concat and drop duplicates:
(pd.concat([left, right])
.drop_duplicates('key', keep='last')
.sort_index()
.reset_index(drop=True))
key A
0 K0 new
1 K1 A1
2 K2 new
3 K3 A3

Related

Multiway merge on keys with duplicates not working

I am trying to do a multi-way left join that doesn't seem to work.
# Create dummy dataframes
lt = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'AX', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K8'],
'B': ['B0', 'B1', 'B2', '32']})
other2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K10'],
'C': ['B0', 'B1', 'B2', '99']})
# Create list of dataframes
dfs=[lt,other,other2]
# Set index to 'key' column for all list elements
dfs = [df.set_index('key') for df in dfs]
# Perform the multi-way left join
final = dfs[0].join(dfs[1:], how = 'left').reset_index()
Returns [final]:
key A B C
0 K0 A0 B0 B0
1 K0 AX B0 B0
2 K1 A1 B1 B1
3 K10 99
4 K2 A2 B2 B2
5 K3 A3
6 K4 A4
7 K5 A5
8 K8 32
Whereas, I need the result [final] as:
key A B C
0 K0 A0 B0 B0
1 K0 AX B0 B0
2 K1 A1 B1 B1
3 K2 A2 B2 B2
4 K3 A3
5 K4 A4
6 K5 A5
What am I doing wrong ? I tried doing this between only two dataframes [lt] and [other] and that doesn't work either. I am thinking it' something to do with the list creation step. Any insight would be very helpful. Thanks.

Unexpected KeyError When Merging Two Multi-Index Dataframes in Pandas

I have two multi-index dataframes that I want to merge on a common column in the second level. Trying to outer merge the two dfs returns an unexpected KeyError on the final merge key.
I've tested the merge without the multi-index and it works fine. I've also flipped the order of the merge and it seems to always occur on the right_on param. Finally, I've confirmed that I can access the erring key-series outside of the merge just fine..
single index merge works fine:
[IN]:
df1 = pd.DataFrame({'A1': ['A1', 'A1', 'A2', 'A3'],
'B': ['121', '345', '123', '146'],
'C': ['K0', 'K1', 'K0', 'K1']})
df2 = pd.DataFrame({'A2': ['A1', 'A3'],
'X': ['B0', 'B3'],
'Y': ['121', '345'],
'Z': ['D0', 'D1']})
fine_merge = pd.merge(df1,df2,how='outer',left_on='A1', right_on='A2')
print(fine_merge)
[OUT]:
A1 B C A2 X Y Z
0 A1 121 K0 A1 B0 121 D0
1 A1 345 K1 A1 B0 121 D0
2 A2 123 K0 NaN NaN NaN NaN
3 A3 146 K1 A3 B3 345 D1
multi-index key works fine:
[IN]:
df1.columns = pd.MultiIndex.from_tuples([('left_header', c) for c in df1.columns])
df2.columns = pd.MultiIndex.from_tuples([('right_header', c) for c in df2.columns])
print(df2['right_header','A2'])
[OUT]:
0 A1
1 A3
Name: (right_header, A2), dtype: object
but multi-index merge returns a KeyError
[IN]:
error_merge = pd.merge(df1,df2, how='outer', left_on=['left_header','A1'], right_on=('right_header','A2'))
print(error_merge)
[OUT]:
KeyError: 'A2'
I am rather confused by this, especially given that if I reverse the merge or such that df1 is the right and right_on==['left_header','A1'] the resulting error is KeyError: 'A1'
Thanks for the help in advance.
edit: combine, join, concat all yield the following result:
combined
left_header right_header
A1 B C A2 X Y Z
0 A1 121 K0 A1 B0 121.0 D0
1 A1 345 K1 A3 B3 345.0 D1
2 A2 123 K0 NaN NaN NaN NaN
3 A3 146 K1 NaN NaN NaN NaN
You can try the below solutions:
Using combine.first
df1.combine_first(df2)
Using Concat:
pd.concat([df1, df2], axis=1)
Simple Join:
df1.join(df2, how='outer')

pandas dataframe row manipulation

I'm sure that I'm missing something simple, but I haven't be able to figure this one out.
I have a DataFrame in Pandas with multiple rows that have the same keys, but different information. I want to place these rows onto the same row.
df = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
This will give me a dataframe with 4 rows and 3 columns. But there is a duplicate value 'KO' in 'key'
Is there any way to turn this into a dataframe with 3 rows, and 5 columns like shown below?
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A2', 'A3'],
'B': ['B0', 'B2', 'B3'],
'A_1': ['A1', 'NaN', 'NaN'],
'B_1': ['B1', 'NaN', 'NaN']})
Perform groupby on cumcount, then concatenate individual groups together.
gps = []
for i, g in df.groupby(df.groupby('key').cumcount()):
gps.append(g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1))
r = pd.concat(gps, 1).sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
You can shorten this somewhat using a list comprehension -
r = pd.concat(
[g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1)
for i, g in df.groupby(df.groupby('key').cumcount())],
axis=1)\
.sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
Let's use set_index, groupby, cumcount, and unstack, then flatten multiindex with map and format:
df_out = df.set_index(['key', df.groupby('key').cumcount()]).unstack()
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
key A_0 A_1 B_0 B_1
0 K0 A0 A1 B0 B1
1 K1 A2 None B2 None
2 K2 A3 None B3 None
I think this alter the layout. just put key as an index to access fields :
df2 = df.set_index([df.key,df.index])
Then
In [248]: df2.loc['K1']
Out[248]:
A B key
2 A2 B2 K1
In [249]: df2.loc['K0']
Out[249]:
A B key
0 A0 B0 K0
1 A1 B1 K0
and iter on rows.

merge pandas dataframe with key duplicates

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, but in such a way that when both have the same duplicate those duplicates are merged respectively. In addition if one dataframe has more duplicates of a key than the other, I'd like it's values to be filled as NaN. For example:
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']},
columns=['key', 'A'])
df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']},
columns=['key', 'B'])
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K2 A3
4 K2 A4
5 K3 A5
key B
0 K0 B0
1 K1 B1
2 K2 B2
3 K2 B3
4 K3 B4
5 K3 B5
6 K4 B6
I'm trying to get the following output
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K2 A3 B3
6 K2 A4 NaN
8 K3 A5 B4
9 K3 NaN B5
10 K4 NaN B6
So basically, I'd like to treat the duplicated K2 keys as K2_1, K2_2, ... and then do the how='outer' merge on the dataframes.
Any ideas how I can accomplish this?
faster again
%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np
def cg(x):
cnt = defaultdict(lambda: 0)
for j in x.tolist():
cnt[j] += 1
yield cnt[j]
def fastcount(x):
return [i for i in cg(x)]
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
faster answer; not scalable
def fastcount(x):
unq, inv = np.unique(x, return_inverse=1)
m = np.arange(len(unq))[:, None] == inv
return (m.cumsum(1) * m).sum(0)
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
old answer
df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()
df1.merge(df2, how='outer').drop('cc', 1)
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
merged_df = pd.merge(df1, df2, left_index = True, right_index = True, how= 'inner')
merged_df.reset_index('key', drop=False, inplace=True)

Using Merge on a column and Index in Pandas

I have two separate dataframes that share a project number. In type_df, the project number is the index. In time_df, the project number is a column. I would like to count the number of rows in type_df that have a Project Type of 2. I am trying to do this with pandas.merge(). It works great when using both columns, but not indices. I'm not sure how to reference the index and if merge is even the right way to do this.
import pandas as pd
type_df = pd.DataFrame(data = [['Type 1'], ['Type 2']],
columns=['Project Type'],
index=['Project2', 'Project1'])
time_df = pd.DataFrame(data = [['Project1', 13], ['Project1', 12],
['Project2', 41]],
columns=['Project', 'Time'])
merged = pd.merge(time_df,type_df, on=[index,'Project'])
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
Error:
Name 'Index' is not defined.
Desired Output:
2
If you want to use an index in your merge you have to specify left_index=True or right_index=True, and then use left_on or right_on. For you it should look something like this:
merged = pd.merge(type_df, time_df, left_index=True, right_on='Project')
Another solution is use DataFrame.join:
df3 = type_df.join(time_df, on='Project')
For version pandas 0.23.0+ the on, left_on, and right_on parameters may now refer to either column names or index level names:
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)
print (left)
A B key2
key1
K0 A0 B0 K0
K0 A1 B1 K1
K1 A2 B2 K0
K2 A3 B3 K1
print (right)
C D key2
key1
K0 C0 D0 K0
K1 C1 D1 K0
K2 C2 D2 K0
K2 C3 D3 K1
df = left.merge(right, on=['key1', 'key2'])
print (df)
A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
You must have the same column in each dataframe to merge on.
In this case, just make a 'Project' column for type_df, then merge on that:
type_df['Project'] = type_df.index.values
merged = pd.merge(time_df,type_df, on='Project', how='inner')
merged
# Project Time Project Type
#0 Project1 13 Type 2
#1 Project1 12 Type 2
#2 Project2 41 Type 1
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
2

Categories