How to merge two dataframe with some row values equal? - python

I have two dataframes which I want to merge into one. The first one has as its columns the ID, while the second has the same values but in the column named id_number. I tried the below code, but in the end the final_df has both ID and the id_number columns and their values. How can I keep only one column for the ids after merging?
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner')
Also, let's say the following dataframe format for df column A:
A
0
1
2
The same column A in the second dataframe has some empty fields, like this:
A
-
1
2
After merge, how can the final dataframe compound the two dataframes so that A won't have empty values?

try selecting required columns after merge
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner')[['ID', 'col1', 'col2']]
or drop the column after merge
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner').drop(['id_number'], axis=1)

The solution you're looking for:
df.combine_first(df2.rename(columns={'id_number': 'ID'}))
A full working example:
import pandas as pd
dfa = pd.DataFrame({'ID': [1, 2, 3], 'other': ['A', 'B', 'C']})
dfb = pd.DataFrame({'id_number': [None, 2, 3], 'other_2': ['A2', 'B2', 'C2']})
dfa.combine_first(dfb.rename(columns={'id_number': 'ID'}))

Rename 'on-the-fly' id_number column of df2 to ID
final_df = df.merge(
df2.rename(columns={'id_number': 'ID'}),
on='ID',
how='inner')

Related

python - drop duplicated index in place in a pandas dataframe

I have a list of dataframes:
all_df = [df1, df2, df3]
I would like to remove rows with duplicated indices in all dataframes in the list, such that the changes are reflected in the original dataframes df1, df2 and df3.
I tried to do
for df in all_df:
df = df[~df.index.duplicated()]
But the changes are only applied in the list, not on the original dataframes.
Essentially, I want to avoid doing the following:
df1 = df1[~df1.index.duplicated()]
df2 = df2[~df2.index.duplicated()]
df3 = df3[~df3.index.duplicated()]
all_df = [df1,df2,df3]
You need recreate list of DataFrames:
all_df = [df[~df.index.duplicated()] for df in all_df]
Or:
for i, df in enumerate(all_df):
all_df[i] = df[~df.index.duplicated()]
print (all_df[0])
EDIT: If name of dictionary is important use dictionary of DataFrames, but also inplace modification df1, df2 is not here, need select by keys of dicts:
d = {'price': df1, 'volumes': df2}
d = {k: df[~df.index.duplicated()] for k, df in all_df.items()}
print (d['price'])

Merge a list of pandas dataframes WITH different column names each time

I have a list of dataframes with a respective column names to join so for example:
dfs = [df1, df2, df3, df4]
col_join = ["col1", "col2", "col3"]
I have seen answers using the reduce function in Python.
import pandas as pd
from functools import reduce
reduce(lambda x, y: pd.merge(x, y, on=["Col1"], how="outer"), dfs)
What I want to achieve is the following:
df1 Columns:
Data1 Dim1 Dim2 Dim3
df2 Columns:
Example1 Example2 Example3
df3 Columns:
Other1 Other2 Other3
df1 to df2 is joined by Dim1 to Example1.
df1 to df3 is joined by Dim2 to Other1.
This list goes until df(n) where n can be even 20 dataframes being joined to df1 on different column names.
My idea is to pass a function with the list of the original df1 and the remainder df2, df3, df4 ... dfn.
Other argument is the list of merging columns, like above for example it would be: left_on=["Dim1"], right_on=["Example1"].
Next would be, joining df1 (already including df2 in the join) to df3 on Dim2 and Other1.
Each dataframe will be joined to df1 on a different column, which may or may not share the same name as df1, that's why left and right are arguments that should be used.d
How to incorporate the fact that merging columns are changing at each join inside the reduce function?
Thank you in advance.
This might work (untested):
result = df1
for df, col in zip(dfs[1:], col_join):
result = pd.merge(result, df, on=[col], how='outer')

pandas - merge df with multiindex on column with other df on index (without multiindex)

My Google-fu didn't bring me the answer so I am posting my question here.
Let's say I have two data frames df1 and df2 and I want to merge them. df1 has a multi-index on columns and df2 consists of one multi-index column with an index. The index of df2 has a name that coincides with a name (at level 1) of one column in df1. How to merge the frames using one column in df1 and the index of df2? Simple example would go like this:
import pandas as pd
df1 = pd.DataFrame({('A', 'Col_1'): [1, 2, 3],
('A', 'Col_2'): ['A', 'B', 'C'],
('B', 'Col_1'): [1, 2, 3],
('B', 'Col_2'): ['A', 'B', 'C']})
df2 = pd.DataFrame({('C', 'Col_1'): ['X', 'Y', 'Z']},
index=pd.Index(['A', 'B', 'C'], name='Col_2'))
My aim is to merge df1 on column ('B', 'Col_2') with df2 on index, preserving all the columns in df1. How to do that?
As per understanding of the question you want df1 and df2 to be joined based on Col_2. Here is how you can do it. If some how i missed the part, please add in the comment.
#dropping the group header of columns from df1
df1.columns = df1.columns.droplevel(0)
#Removing the duplicated columns in df1
df1 = df1.loc[:,~df1.columns.duplicated()]
#dropping the group header of columns from df2
df2.columns = df2.columns.droplevel(0)
#Reset the index of df2 as first column
df2.reset_index(level=0, inplace=True)
#Concatinating 2 dataframes
new_df = pd.concat([df1.set_index('Col_2'),df2.set_index('Col_2')], axis=1,
join='inner').reset_index()
The final output will look like this
Col_2 Col_1 Col_1
0 A 1 X
1 B 2 Y
2 C 3 Z

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

python merge two dataframe with common columns

I am trying to merge two dataframe, which has 3 column with same name.
df1 = ['id','first','last','present']
df2 = ['id','first','last','age','date','location']
I want to join on these three common columns, I tried following
cols = df2.columns.difference(df1.columns)
result = pd.merge(df1, df2[cols_to_use], left_index=True, right_index=True, how='outer')
to get
['id','first','last','present','age','date','location']
Is this can be done in step pd.merge ?

Categories