How To Merge Two Data Frames in Pandas Python [duplicate] - python

This question already has answers here:
How do I combine two dataframes?
(8 answers)
Pandas Merging 101
(8 answers)
Closed 10 months ago.
How To Merge/Concat Two Data Frames
I want to merge two dataframes: the first one is a dataframe with one column with datetime64 dtype and the second one is a float dtype one column dataframe. This is what I have tried:
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df_merge = pd.merge(df1,df2,left_on='MemStartDate',right_on='TotalPrice')
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
But how can I do that ?

you can try this.
df_merge = pd.concat([df1, df2], axis=1)

Best option to use pd.concat but you also can try dataframe.join(dataframe).
for more information try to go through this Merge, join, concatenate and compare
df_merge=df1.join(df2)

Let us consider the following situation:
import pandas as pd
# Create dataframe with one column of type datatime64 and one float64
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
df= pd.DataFrame(dictionary)
pd.to_datetime(df['MemStartDate']) #dtype: datetime64[ns]
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df.TotalPrice # dtype: float64
Where you have df1 and df4 that are:
df1
Out:
MemStartDate
0 2007-07-13
1 2006-01-13
2 2010-08-13
df4
Out:
TotalPrice
0 50.5
1 10.4
2 3.5
If you want to concat df1 and df4, it means that you want to concatenate pandas objects along a particular axis with optional set logic along the other axes (see pandas.concat — pandas 1.4.2 documentation). Thus in practice:
df_concatenated = pd.concat([df1, df4], axis=1)
df_concatenated
The new resulting dataframe df_concatenated is this:
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
The axis decides where you want to concatenate along. With axis=1 you have concatenated the second dataframe along columns of the first dataframe. You can try with axis=0:
df_concatenated = pd.concat([df1, df4], axis=0)
df_concatenated
The output is:
Out:
MemStartDate TotalPrice
0 2007-07-13 NaN
1 2006-01-13 NaN
2 2010-08-13 NaN
0 NaN 50.5
1 NaN 10.4
2 NaN 3.5
Now you have added the second dataframe along rows of the first dataframe.
On the other hand, merge is used to join dataframes when they share some columns. It is useful because maybe you do not want to store dataframes with same contents repeatedly. For example:
# Create two dataframes
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
dictionary_1 = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13', '2010-08-14'],
'Shop':['Shop_1','Shop_2','Shop_3','Shop_4']}
df= pd.DataFrame(dictionary)
df_1 = pd.DataFrame(dictionary_1)
if you have df and df_1 that are:
df
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
and
df_1
Out:
MemStartDate Shop
0 2007-07-13 Shop_1
1 2006-01-13 Shop_2
2 2010-08-13 Shop_3
3 2010-08-14 Shop_4
You can merge them in this way:
df_merged = pd.merge(df,df_1, on='MemStartDate', how='outer')
df_merged
Out:
MemStartDate TotalPrice Shop
0 2007-07-13 50.5 Shop_1
1 2006-01-13 10.4 Shop_2
2 2010-08-13 3.5 Shop_3
3 2010-08-14 NaN Shop_4
In the new dataframe df_merged, you keep the common column of the old dataframes df and df_1 (MemStartDate) and add the two columns that are different in the two dataframes (TotalPrice and Shop).
----> A couple of other explicative examples about merging dataframes in Pandas:
Example 1. Merging two dataframes preserving one column that is equal for both dataframes:
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
left
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
right
result = pd.merge(left, right, on="key")
result
Out:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
Example 2. Merging two dataframes in order to read all the combinations of values
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
result = pd.merge(df1,df2, left_on='lkey', right_on='rkey')
result
Out:
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Also in this case you can check the pandas.DataFrame.merge — pandas 1.4.2 documentation (where I took the second example) and here you have other possible ways to manipulate your dataframes: Merge, join, concatenate and compare (where I took the first example).
In the end, to sum up, you can intuitively understand what pd.concat() and pd.merge() do by studying the meaning of their names in spoken language:
Concatenate: to link together in a series or chain
Merge: to cause to combine, unite, or coalesce
And to come back to your error:
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
It is telling you that the common column of the two dataframes are of different data type. So he understands that you are trying to do something that is "pd.concat's job" and so he suggests you to use pd.concat.

Related

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Pandas - merge two dataframes based off of intersection of columns

I am new to dataframe manipulation. I've been playing around with df.merge, df.join, pd.concat and I've been getting frequent errors while being unable to merge without duplicates.
I have two representative dataframes I want to merge.
df1 = pd.DataFrame({'1990' : 1, '1991': 2, '1992': 3}, index = ['a','b','c'])
df2 = pd.DataFrame({'1989':0,'1990' : 1, '1991': 2, '1992': 3, '1993': 4}, index = ['d'])
I want to merge them by the intersection of the columns of the two dataframes while adding the row at the same time. Is there a way to use a dataframe method to do this?
The final product should look like:
Use concat with inner join:
df = pd.concat([df1, df2], join='inner')
print (df)
1990 1991 1992
a 1 2 3
b 1 2 3
c 1 2 3
d 1 2 3

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

Merge multiple data frames with different dimensions using Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have the following data frames (in reality they are more than 3).
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
# Note that the value in column 'head' is always unique
What I want to do is to merge them based on head column. And whenever the value of a head does not exist in one data frame we would assign it with NA.
In the end it'll look like this:
head1 head2 head3
-------------------------------
foo 11 1 NA
bix 22 NA NA
bar 32 3 100
xoo NA 2 20
qux NA 10 NA
How can I achieve that using Pandas?
You can use pandas.concat selecting the axis=1 to concatenate your multiple DataFrames.
Note however that I've first set the index of the df1, df2, df3 to use the variables (foo, bar, etc) rather than the default integers.
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
df1 = df1.set_index('head1')
df2 = df2.set_index('head2')
df3 = df3.set_index('head3')
df = pd.concat([df1, df2, df3], axis = 1)
columns = ['head1', 'head2', 'head3']
df.columns = columns
print(df)
head1 head2 head3
bar 32 3 100
bix 22 NaN NaN
foo 11 1 NaN
qux NaN 10 NaN
xoo NaN 2 20

Merge after groupby

I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame

Categories