Merge Pandas DataFrames on Two Column Values Irrespective of Order in Row - python

Given two dataframes:
df1 = pd.DataFrame([
['Red', 'Blu', 1.1],
['Yel', 'Blu', 2.1],
['Grn', 'Grn', 3.1]], columns=['col_1a','col_1b','score_1'])
df2 = pd.DataFrame([
['Blu', 'Red', 1.2],
['Yel', 'Blu', 2.2],
['Vio', 'Vio', 3.2]], columns=['col_2a','col_2b','score_2'])
I want to merge them on two columns like below:
df3 = pd.DataFrame([
['Blu', 'Red', 1.1, 1.2],
['Yel', 'Blu', 2.1, 2.2],
], columns=['col_a','col_b','score_1','score_2'])
Caveat 1: The order of column contents can switch between dataframes to merge. The first row, for example, should be merged because it contains both "Red" and "Blue" even if they appear in different columns.
Caveat 2: The order of columns in the final df_3 is unimportant. Whether "Blu" is in col_a or col_b doesn't mean anything.
Caveat 3: Anything else not matching, like the last row, is ignored

You can sort the first two columns along the rows, then merge on them:
# rename column names
cols = ['col_a', 'col_b']
df1.columns = cols + ['score_1']
df2.columns = cols + ['score_2']
# sort the two id columns along the row
df1[cols] = pd.np.sort(df1[cols], axis=1)
df2[cols] = pd.np.sort(df2[cols], axis=1)
# merge
df1.merge(df2)

Related

Find the difference between two columns in a dataframe but keeping the row index avaiable

I have two dataframes:
df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.Dataframe({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key": [1, 2, 3, 4, 5, 6]})
I want to use something like a set(df2).difference(df1) to find the difference between the product columns but I want to keep the indexes. So ideally the output would look like this
result =['aples', 'appples', 'banans', 'kiwki'][2 3 4 6]
Whenever I use the set.difference() I get the list of the different values but I lose the key index.
You have to filter the df2 frame checking if the elements from df2 are not in df1:
df2[~df2["product"].isin(df1['product'])]
~ negates the values of a boolean Series.
ser1.isin(ser2) is a boolean Series which gives, for each element of ser 1, whether or not the value can be found in ser2.
I guess you are trying to do a left anti join, which means you only want to keep the rows in df2 that aren't present in df1. In that case:
df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.DataFrame({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key":[1, 2, 3, 4, 5, 6]})
# left join
joined_df = df2.merge(df1, on='product', how='left', indicator=True)
# keeping products that were only present in df2
products_only_in_df2 = joined_df.loc[joined_df['_merge'] == 'left_only', 'product']
# filtering df2 using the above df so we have the keys as well
result = df2[df2['product'].isin(products_only_in_df2)]

Pivot over multiple tables with pandas

I want to create a pivot with average values over multiple tables. Here is an example that I want to create: Inputs are df1 and df2, res is the result I want to calculate from df1 and df2
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"2000": ["A", "A", "B"],
"2001": ["A", "B", "B"],
"2002": ["B", "B", "B"]},
index =['Item1', 'Item2', 'Item3'])
df2 = pd.DataFrame({"2000": [0.5, 0.7, 0.1],
"2001": [0.6, 0.6, 0.3],
"2002": [0.7, 0.4, 0.2]},
index =['Item1', 'Item2', 'Item3'])
display(df1)
display(df2)
res = pd.DataFrame({"2000": [0.6, 0.1],
"2001": [0.6, 0.45],
"2002": [np.nan, 0.43]},
index =['A', 'B'])
display(res)
Both dataframes have years in columns. Each row is an item. The items change state over time. The state is defined in df1. They also have values each year, defined in df2. I want to calculate the average value by year for each group of states A, B.
I did not achieve to calculate res, any suggestions?
To solve this problem you should merge both DataFrames in one, at first. For example you can use this code convert dataframe from wide to long and then merge both of them by the index (year, item), and finally reset the index to be used as a column in the pivot:
df_full = pd.concat([df1.unstack(), df2.unstack()], axis=1).reset_index()
Then, if you want, you can rename columns to build a clear pivot:
df_full = df_full.rename(columns={'level_0': 'year', 'level_1': 'item', 0: 'DF1', 1:'DF2'})
And finally build a pivot table.
res_out = pd.pivot_table(data=df_full, index='DF1', columns='year', values='DF2', aggfunc='mean')
It's not a one line solution, but it works.
df_full = pd.concat([df1.unstack(), df2.unstack()], axis=1).reset_index()
df_full = df_full.rename(columns={'level_0': 'year', 'level_1': 'item', 0: 'DF1', 1:'DF2'})
res_out = pd.pivot_table(data=df_full, index='DF1', columns='year', values='DF2', aggfunc='mean')
display(res_out)
This code using stack, join and unstack should work:
df1_long = df1.stack().to_frame().rename({0:'category'}, axis=1)
df2_long = df2.stack().to_frame().rename({0:'values'}, axis=1)
joined_data = df1_long.join(df2_long).reset_index().rename({'level_0':'item','level_1':'year'}, axis=1)
res = joined_data.groupby(['category', 'year']).mean().unstack()
display(res)

Pandas merge two dataframes with one to many relationship

I am trying to merge two pandas DataFrames with one of many relationship.
import pandas as pd
df1 = pd.DataFrame({'name': ['AA', 'BB', 'CC'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df2 = pd.DataFrame({'name': ['AA', 'AA', 'BB'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df_merged = pd.merge(
df1,
df2,
left_on = 'name',
right_on = 'name',
how = "inner"
)
Two questions.
How do I join the two DataFrames using pd.merge without inserting new rows in df1? Shape of df1 needs not change. name is unique in df1.
For rows with one-to-many relationship, I'd like join the first row from df2.
When I merge the two DataFrames, it creates new columns - col1.x, col2.x, col1.y, col2.y? I'd like only copy of those columns.
Use left join and drop duplicates
df1.merge(df2, how='left', on='name').drop_duplicates(subset='name',keep='first')

Merge a dataframe only when the column values are identical

I have two data frames df and df_copy. I would like to copy the data from df_copy, but only if the data is also identical. How do I do that?
import pandas as pd
d = {'Nameid': [100, 200, 300, 100]
, 'Name': ['Max', 'Michael', 'Susan', 'Max']
, 'Projectid': [100, 200, 200, 100]}
df = pd.DataFrame(data=d)
display(df.head(5))
df['nameid_index'] = df['Nameid'].astype('category').cat.codes
df['projectid_index'] = df['Projectid'].astype('category').cat.codes
display(df.head(5))
df_copy = df.copy()
df.drop(['Nameid', 'Name', 'Projectid'], axis=1, inplace=True)
df = df.drop([1, 3])
display(df.head(5))
df
df_copy
What I want
I looked at Pandas Merging 101
df.merge(df_copy, on=['nameid_index', 'projectid_index'])
But I got this result
The same row are twice, I only want once.
Use DataFrame.drop_duplicates first:
df1 = (df.drop_duplicates(['nameid_index', 'projectid_index'])
.merge(df_copy, on=['nameid_index', 'projectid_index']))
If need merge by intersection of columns names in both DataFrames, on parameter should be removed:
df1 = df.drop_duplicates(['nameid_index', 'projectid_index']).merge(df_copy)

How to combine two pd dataframe, re-rank the ranking based on the score and return the entire row for the highest ranking?

I'm trying to combine 2 pandas dataframe and rank the combined dataframe again, after that, extract the row with highest score/ with the 1st ranking as the return. I tried something shown below in my code but it return me the 1st index instead of 1st rank.
df = pd.DataFrame({'Name' : ['Peter', 'James', 'John', 'Marry'], 'Score' : [6.1, 5.6, 6.8, 7.99],
'ranking' :[ 3, 4, 2, 1]})
df1 = pd.DataFrame({'Name' : ['Albert', 'Kelsey', 'Janice'], 'Score' : [1.1, 8.2, 7],
'ranking' :[ 3, 1, 2]})
pd_combine = pd.concat([df,df1], sort=False)
pd_combine.iloc[pd_combine['Score'].idxmax()]
Just change the third line with this:
pd_combine = pd.concat([df,df1], sort=False).reset_index()
Problem was that you had duplicated indices. This change ensures that every row has its own unique id in this new combined data frame.
If you want to rank again entries in new, combined data frame use:
res = pd_combine.sort_values('Score',ascending=False)
res.ranking = range(1,len(res)+1)
Instead of sort=False, specify ignore_index=True
pd_combine = pd.concat([df,df1], ignore_index=True)

Categories