Merging multiple dataframes on different columns - python

Using Pandas 1.2.1
MRE:
df_a = pd.DataFrame({"A":[1,2,3,4], "B":[33, 44, 55, 66]})
df_b = pd.DataFrame({"B":[33, 44,99], "C":["v", "z", "z"]})
df_c = pd.DataFrame({"A":[3,4,77,55], "D":["aa", "bb", "cc", "dd"]})
Using three dfs created above I want to join all of them together however
df_a, df_b share column "B" therefore they join on column "B"
df_a, df_c share column "A" therefore they join on column "A"
I want to left_join df_b and df_c onto df_a. currently this is my method:
merged_df = pd.merge(df_a, df_b, on=["B"], how="left")
merged_df = pd.merge(merged_df, df_c, on=["A"], how="left")
I know works fine however I cannot stop to think there is a easier and faster way, there are multiple questions on joining multiple dfs on same column using reduce function however could not find solution for my question.

You can remove on parameter, so it merging by intersection of columns names between DataFrames:
merged_df = pd.merge(df_a, df_b, how="left")
merged_df = pd.merge(merged_df, df_c, how="left")
More dynamic is use reduce, also is removed on parameter:
from functools import reduce
dfList = [df1, df2, df3]
df = reduce(lambda df1,df2: pd.merge(df1,df2,how="left"), dfList)

Related

Mapping Two dataframes Pandas

I want to map two dataframes in pandas , in DF1 I have
df1
my second dataframe looks like
df2
I want to merge the two dataframes and get something like this
merged DF
on the basis of the 1 occuring in the DF1 , it should be replaced by the value after merging
so far i have tried
mergedDF = pd.merge(df1,df2, on=companies)
Seems like you need .idxmax() method.
merged = df1.merge(df2, on='Company')
merged['values'] = merged[[x for x in merged.columns if x != 'Company']].idxmax(axis=1)

Concatenate multiple dataframe and columns names

I have a list of data-frames
liste = [df1, df2, df3, df4]
sharing same index called "date". I concatenate this as follow:
pd.concat( (dd for dd in ll ), axis=1, join='inner')
But the columns have the same name. I can override the columns name manually, but I wonder if there is a way that the columns name will take the corresponding data-frame names, in this case "df1", "df2".
You can replace them as followes:
import pandas as pd
from functools import reduce
liste = [df1, df2, df3, df4]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), liste)
Or:
... code snippet ...
df1.merge(df2,on='col_name').merge(df3,on='col_name').merge(df4,on='col_name')
Update based on comment:
An example for automated grabbing the column names of each you may integrate below code (while I assume its a single column array) to your liking:
colnames = {}
for i in range(len(dfs)):
name = df[i].columns
colnames[i+1] = name
... merge with code above ...
you could use merge
df=liste[0]
for data_frame in liste[1:]:
df=df.merge(date_frame, left_index=True,right_index=True)
by default you'll get y_ appended to the columns so you'll end up with _y_y etc but you can control this with suffixes= so perhaps you use the position with an enumerate in the loop?

Merging DFs from two different lists in python

There are two lists where elements are DFs and having datetimeindex:
lst_1 = [ df1, df2, df3, df4] #columns are same here 'price'
lst_2 = [df1, df2, df3, df4] #columns are same here 'quantity'
I am doing it with one by one using the pandas merge function. I tried to do something where i add the two list and make function like this:
def df_merge(df1 ,df1):
p_q_df1 = pd.merge(df1,df1, on='Dates')
return p_q_df1
#this merged df has now price and quantity representing df1 from list! and list_2
still i have to apply to every pair again. Is there a better way, maybe in loop to automate this?
Consider elementwise looping with zip which can be handled in a list comprehension.
# DATES AS INDEX
final_lst = [pd.concat(i, j, axis=1) for i, j in zip(lst_1, lst_2)]
# DATES AS COLUMN
final_lst = [pd.merge(i, j, on='Dates') for i, j in zip(lst_1, lst_2)]
IIUC,
you could concat your df's then merge
dfs_1 = pd.concat(lst_1)
dfs_2 = pd.concat(lst_2)
pd.merge(dfs_1,dfs_2,on='Dates',how='outer')
# change how to specify the behavior of the merge.
I'm assuming your dataframes are the same shape so they can be concatenated.
if you want to merge multiple dataframes in your list you can use the reduce function from the standard python lib using an outer merge to get every possible row.
from functools import reduce
lst_1 = [ df1, df2, df3, df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Dates'],
how='outer'), lst_1)
lst_1 = [ df1, df2, df3, df4] #columns are same here 'price'
lst_2 = [df1, df2, df3, df4] #columns are same here 'quantity'
def merge(lst_1, lst_2):
df = pd.DataFrame()
for _df in lst_1:
df = df.merge(_df, on='Dates')
for _df in lst_2:
df = df.merge(_df, on='Dates')

How to call a list of DataFrames in a function? [duplicate]

I have multiple dataframes:
df1, df2, df3,..., dfn
They have the same type of data but from different groups of descriptors that cannot be joined. Now I need to apply the same function to each dataframe manually.
How can I apply the same function to multiple dataframes?
pipe + comprehension
If your dataframes contain related data, as in this case, you should store them in a list (if numeric ordering is sufficient) or dict (if you need to provide custom labels to each dataframe). Then you can pipe each dataframe through a function foo via a comprehension.
List example
df_list = [df1, df2, df3]
df_list = [df.pipe(foo) for df in df_list]
Then access your dataframes via df_list[0], df_list[1], etc.
Dictionary example
df_dict = {'first': df1, 'second': df2, 'third': df3}
df_dict = {k: v.pipe(foo) for k, v in df_dict.items()}
Then access your dataframes via df_dict['first], df_dict['second'], etc.
If the data frames have the same columns you could concat them to a single data frame, but otherwise there is not really a "smart" way of doing it:
df1, df2, df3 = (df.apply(...) for df in [df1, df2, df3]) # or either .map or .applymap

pandas combine_first with particular index columns?

I'm trying to join two dataframes in pandas to have the following behavior: I want to join on a specified column, but have it so redundant columns are not added to the dataframe. This is analogous to combine_first except combine_first does not seem to take an index column optional argument. Example:
# combine df1 and df2 based on "id" column
df1 = pandas.merge(df2, how="outer", on=["id"])
The problem with the above is that columns common to df1/df2 aside from "id" will be added twice (with _x,_y prefixes) to df1. How can I do something like:
# Do outer join from df2 to df1, matching items by "id" but not adding
# columns that are redundant (df1 takes precedence if the values disagree)
df1.combine_first(df2, on=["id"])
How can this be done?
If you are trying to merge columns from df2 into df1 while excluding any redundant columns, the following should work.
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")
However this obviously will not update any values from df1 with values from df2 as it is only bringing in non-redundant columns. But since you said df1 will take precedence on any values that disagree, perhaps this will do the trick?

Categories