Pandas merge two dataframes with one to many relationship - python

I am trying to merge two pandas DataFrames with one of many relationship.
import pandas as pd
df1 = pd.DataFrame({'name': ['AA', 'BB', 'CC'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df2 = pd.DataFrame({'name': ['AA', 'AA', 'BB'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df_merged = pd.merge(
df1,
df2,
left_on = 'name',
right_on = 'name',
how = "inner"
)
Two questions.
How do I join the two DataFrames using pd.merge without inserting new rows in df1? Shape of df1 needs not change. name is unique in df1.
For rows with one-to-many relationship, I'd like join the first row from df2.
When I merge the two DataFrames, it creates new columns - col1.x, col2.x, col1.y, col2.y? I'd like only copy of those columns.

Use left join and drop duplicates
df1.merge(df2, how='left', on='name').drop_duplicates(subset='name',keep='first')

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```
pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)

Find the difference between two columns in a dataframe but keeping the row index avaiable

I have two dataframes:
df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.Dataframe({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key": [1, 2, 3, 4, 5, 6]})
I want to use something like a set(df2).difference(df1) to find the difference between the product columns but I want to keep the indexes. So ideally the output would look like this
result =['aples', 'appples', 'banans', 'kiwki'][2 3 4 6]
Whenever I use the set.difference() I get the list of the different values but I lose the key index.
You have to filter the df2 frame checking if the elements from df2 are not in df1:
df2[~df2["product"].isin(df1['product'])]
~ negates the values of a boolean Series.
ser1.isin(ser2) is a boolean Series which gives, for each element of ser 1, whether or not the value can be found in ser2.
I guess you are trying to do a left anti join, which means you only want to keep the rows in df2 that aren't present in df1. In that case:
df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.DataFrame({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key":[1, 2, 3, 4, 5, 6]})
# left join
joined_df = df2.merge(df1, on='product', how='left', indicator=True)
# keeping products that were only present in df2
products_only_in_df2 = joined_df.loc[joined_df['_merge'] == 'left_only', 'product']
# filtering df2 using the above df so we have the keys as well
result = df2[df2['product'].isin(products_only_in_df2)]

Joining a pandas table with multi-index

I have two tables that I want to join - the main table has index SourceID, the sub-table is multi-indexed as it comes from a pivot table - indexes are (SourceID, sourceid)
How can I join a table with a single index to one with multi-index (or change the multi-indexed table to singular)?
The sub-table is created as follows:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
df_sub = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
# .add_prefix('sales_')
.reset_index()
)
L = [(a, f'{a.lower()}{b}') for a, b in df_sub.columns]
df_sub.columns = pd.MultiIndex.from_tuples(L)
df_sub = df_sub.reset_index()
I'm then trying to join it with the main table df_main
df_all = df_sub.join(df_main.set_index('SourceID'), on='SourceID.sourceid')
but this fails due to the multi-index. The index in the sub-table could be single as long as I don't lost the multi-index on the other fields.
It is possible, but then MultiIndex values are converted to tuples:
df_all = df_sub.join(df.set_index('SourceID'), on=[('SourceID','sourceid')])
print (df_all)
If want MultiIndex in output is necessary convert df columns to MultiIndex too, e.g. by MultiIndex.from_product:
df1 = df.copy()
df1.columns = pd.MultiIndex.from_product([['orig'], df1.columns])
df_all = df_sub.join(df1.set_index([('orig','SourceID')]), on=[('SourceID','sourceid')])

Finding out missing transactions in two excels by using Python

I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df1 I could see that there is one extra transaction called SC-003_Homepage which is not there in df2. Can someone help me how to find only that transaction which is missing in df2?
So far I had done below work to get the transactions.
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='inner')
Maybe a simple set will do the job
set(df1['Transaction_Name']) - set(df2['Transaction_Name'])
Add a merger column and then filter the missing data based on that. see below example.
For more information see merge documentation
import pandas as pd
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
#create a merged df
merge_df = df1.merge(df2, on='Transaction_Name', how='outer', suffixes=['', '_'], indicator=True)
#filter rows which are missing in df2
missing_df2_rows = merge_df[merge_df['_merge'] =='left_only'][df1.columns]
#filter rows which are missing in df1
missing_df1_rows = merge_df[merge_df['_merge'] =='right_only'][df2.columns]
print missing_df2_rows
print missing_df1_rows
Output:
Count Transaction_Name
2 2.0 SC-003_Homepage
Count Transaction_Name
4 NaN SC-002_Signinlink

Selecting columns from dataframe based on the name of other dataframe

I have 3 dataframes,
df
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID':[4, 5, 6, 8, 9],
'CC_ID' : [2, 2, 3, 3, 3],
'DD_RE': [4,7,8,9,0],
'EE_RE':[5,8,9,9,10]})
and df_ID,
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
and the other one isdf_RE, both of these data frames has the column Name, so I need to merge it to data frame df, then I need to select the columns based on the last part of the data frame's name. That is, for example, if the data frame is df_ID then I need all columns ending with "ID" + "Name" for all matching rows from Name from data frame df, and if the data frame id df_REL then I need I all columns ends with "RE" + "Name" from df and I wanted to save it separately.
I know I could call inside the loop as,
for dfs in dataframes:
ID=[col for col in df.columns if '_ID' in col]
df_ID=pd.merge(df,df_ID,on='Name')
df_ID=df_ID[ID]
But here the ID , has to change again when the data frames ends with RE and so on , I have a couple of file with different strings so any better solution would be great
So at the end I need for df_ID as having all the columns ending with ID
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16'],
'AA_ID': [22, 22'],
'BB_ID':[4, 5],
'CC_ID' : [2, 2]})
Any help would be great
Assuming your columns in df are Name and anything with a suffix such as the examples you have listed (e.g. _ID, _RE), then what you could do is parse through the column names to first extract all unique possible suffixes:
# since the suffixes follow a pattern of `_*`, then I can look for the `_` character
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
Now, with the list of suffixes, you next want to create a dictionary of your existing dataframes, where the keys in the dictionary are suffixes, and the values are the dataframes with the suffix names (e.g. df_ID, df_RE):
dfs = {}
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
... # and so forth
Now you can loop through your suffixes list to extract the appropriate columns with each suffix in the list and do the merges and column extractions:
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
Now you have your dictionary of suffixed dataframes. If you want your dataframes as separate variables instead of keeping them in your dictionary, you can now set them back as individual objects:
df_ID = dfs['_ID']
df_RE = dfs['_RE']
... # and so forth
Putting it all together in an example
import pandas as pd
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID': [4, 5, 6, 8, 9],
'CC_ID': [2, 2, 3, 3, 3],
'DD_RE': [4, 7, 8, 9, 0],
'EE_RE': [5, 8, 9, 9, 10]})
# Get unique suffixes
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
dfs = {} # dataframes dictionary
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
df_RE = pd.DataFrame({'Name': ['AC007']})
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
df_ID = dfs['_ID']
df_RE = dfs['_RE']
print(df_ID)
print(df_RE)
Result:
AA_ID BB_ID CC_ID
0 22 4 2
1 22 5 2
DD_RE EE_RE
0 8 9
1 9 9
2 0 10
You can first merge df with df_ID and then take the columns end with ID.
pd.merge(df,df_ID,on='Name')[[e for e in df.columns if e.endswith('ID') or e=='Name']]
Out[121]:
AA_ID BB_ID CC_ID Name
0 22 4 2 CTA15
1 22 5 2 CTA16
Similarly, this can be done for the df_RE df as well.
pd.merge(df,df_RE,on='Name')[[e for e in df.columns if e.endswith('RE') or e=='Name']]

Categories