I am trying to merge two dataframe, which has 3 column with same name.
df1 = ['id','first','last','present']
df2 = ['id','first','last','age','date','location']
I want to join on these three common columns, I tried following
cols = df2.columns.difference(df1.columns)
result = pd.merge(df1, df2[cols_to_use], left_index=True, right_index=True, how='outer')
to get
['id','first','last','present','age','date','location']
Is this can be done in step pd.merge ?
Related
I have a list of dataframes with a respective column names to join so for example:
dfs = [df1, df2, df3, df4]
col_join = ["col1", "col2", "col3"]
I have seen answers using the reduce function in Python.
import pandas as pd
from functools import reduce
reduce(lambda x, y: pd.merge(x, y, on=["Col1"], how="outer"), dfs)
What I want to achieve is the following:
df1 Columns:
Data1 Dim1 Dim2 Dim3
df2 Columns:
Example1 Example2 Example3
df3 Columns:
Other1 Other2 Other3
df1 to df2 is joined by Dim1 to Example1.
df1 to df3 is joined by Dim2 to Other1.
This list goes until df(n) where n can be even 20 dataframes being joined to df1 on different column names.
My idea is to pass a function with the list of the original df1 and the remainder df2, df3, df4 ... dfn.
Other argument is the list of merging columns, like above for example it would be: left_on=["Dim1"], right_on=["Example1"].
Next would be, joining df1 (already including df2 in the join) to df3 on Dim2 and Other1.
Each dataframe will be joined to df1 on a different column, which may or may not share the same name as df1, that's why left and right are arguments that should be used.d
How to incorporate the fact that merging columns are changing at each join inside the reduce function?
Thank you in advance.
This might work (untested):
result = df1
for df, col in zip(dfs[1:], col_join):
result = pd.merge(result, df, on=[col], how='outer')
I have 2 dataframes: df1 and df2. I would like to merge the 2 dataframes on the column link in df2. Link column in df2 contains a list of column and values which match in df1:
df1 = pd.DataFrame({'p':[1,2,3,4], 'a':[1,2,2,2],'b':['z','z','z','z'],'c':[3,3,4,4],'d':[5,5,5,6]})
df2 = pd.DataFrame({'e':[11,22,33,44], 'link':['a=1,c=3','a=2,c=3','a=2,c=4,d=5','a=2,c=4']})
The result should end with dataframe like this where column e from df2 are merge together with df1:
df_res = pd.DataFrame({'p':[1,2,3,3,4], 'a':[1,2,2,2,2],'b':['z','z','z','z','z'],'c':[3,3,4,4,4],'d':[5,5,5,5,6],'e':[11,22,33,44,44]})
How can this be done in pandas?
df1["e"] = df2["e"]
frames = [df1, df2]
result = pd.concat(frames)
df1 result
df2 result
I need to concatenate columns from df1 and df2.
for each row in df2, df1 columns will be concatenated
Final result
Assuming that these are represented as Pandas data frames, the following should return the concatenated data frame you're looking for:
result = pd.concat([df1, df2], axis=1, join="inner")
I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')
I have two dataframes (let's call them df1 and df2). I want to perform an inner join based on the index, but only take the columns from df1.
In SQL, it would be:
Select a.*
From df1 a
Inner join df2 b
On a.index = b.index
My code in Python is:
pd.concat([df1, df2], axis = 1, join = 'inner', join_axes = [df1.index])
But it selects all columns from both df1 and df2.
One way you could do this to use [] after your pd.concat:
pd.concat([df1, df2], axis = 1, join = 'inner', join_axes = [df1.index])[df1.columns]