Merge on key and keep values of first dataframe - python

I have two dataframes:
df1
key value
A 1
B 2
C 2
D 3
df2
key value
C 3
D 3
E 5
F 7
I would like to merge this dataframes by their key and get a dataframe which looks like this one. So, I want to get only one column (no new column with suffixes) and remove the value of df2 if the values do not fit together.
df_merged
key value
A 1
B 2
C 2
D 3
E 5
F 7
How can I do this? Should I rather take join or concatenate? Thanks a lot!

Use concat with DataFrame.drop_duplicates by column key:
df = pd.concat([df1, df2], ignore_index=True).drop_duplicates('key')
print (df)
key value
0 A 1
1 B 2
2 C 2
3 D 3
6 E 5
7 F 7

Just adding to #jezrael's answer, you could also use groupby with first:
>>> pd.concat([df1, df2], ignore_index=True).groupby('key', as_index=False).first()
key value
0 A 1
1 B 2
2 C 2
3 D 3
4 E 5
5 F 7
>>>

Related

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

Add all columns form one dataframe to another without joining on a key/index

Having two dataframes df1 and df2 (same number of rows) how can we, very simply, take all the columns from df2 and add them to df1? Using join, we are joining them on the index or a given column, but assuming their index's are completely different and they have no columns in common. Is that doable (without the obvious way of looping over each column in df2and add them as new to df1)?
EDIT: added an example.
Note; no index, column names are mentioned since it should not matter (thats is the "problem").
df1= [[1,3,2,
[11,20,33]]
df2 = [["bird",np.nan,37,np.sqrt(2)]
["dog",0.123,3.14,0]]
pd.some_operation(df1,df2)
#[[1,3,2,"bird",np.nan,37,np.sqrt(2)]
#[11,20,33,"dog",0.123,3.14,0]]
Samples:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
}, index = list('QRSTUW'))
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}, index = list('KLMNOP'))
Pandas always use index values if use join or concat by axis=1, so for correct alignement is necessary create same index values:
df = df1.join(df2.set_index(df1.index))
df = pd.concat([df1, df2.set_index(df1.index)], axis=1)
print (df)
A B C D E F
Q a 4 7 1 5 a
R b 5 8 3 3 a
S c 4 9 5 6 a
T d 5 4 7 9 b
U e 5 2 1 2 b
W f 4 3 0 4 b
Or create default index in both DataFrames:
df = df1.reset_index(drop=True).join(df2.reset_index(drop=True))
df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b

Pandas Copy columns from one data frame to another with different name

I have to copy columns from one DataFrame A to another DataFrame B. The column names in A and B do not match.
What is the best way to do it? There are several columns like this. Do I need to write for each column like B["SO"] = A["Sales Order"] etc.
i would use pd.concat
combined_df = pd.concat([df1, df2[['column_a', 'column_b']]], axis=1)
also gives you the power to concat different size dateframes , outer join etc.
Use:
df1 = pd.DataFrame({
'SO':list('abcdef'),
'RI':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
print (df1)
SO RI C
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df2)
D E F
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
Create dictionary for rename, select columns matched, rename by dict and DataFrame.join to original - DataFrames matched by index values:
d = {'SO':'Sales Order',
'RI':'Retail Invoices'}
df11 = df1[d.keys()].rename(columns=d)
print (df11)
Sales Order Retail Invoices
0 a 4
1 b 5
2 c 4
3 d 5
4 e 5
5 f 4
df = df2.join(df11)
print (df)
D E F Sales Order Retail Invoices
0 1 5 a a 4
1 3 3 a b 5
2 5 6 a c 4
3 7 9 b d 5
4 1 2 b e 5
5 0 4 b f 4
Make a dictionary of abbreviations. And try this code.
Ex:
full_form_dict = {'SO':'Sales Order',
'RI':'Retail Invoices',}
A_col = list(A.columns)
B_col = [v for k,v in full_form_dict.items() if k in A_col]
# to loop over A_col
# B_col = [v for col in A_col for k,v in full_form_dict.items() if k == col]

How to combine two Series with the same index in python?

I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

Categories