I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df1 I could see that there is one extra transaction called SC-003_Homepage which is not there in df2. Can someone help me how to find only that transaction which is missing in df2?
So far I had done below work to get the transactions.
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='inner')
Maybe a simple set will do the job
set(df1['Transaction_Name']) - set(df2['Transaction_Name'])
Add a merger column and then filter the missing data based on that. see below example.
For more information see merge documentation
import pandas as pd
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
#create a merged df
merge_df = df1.merge(df2, on='Transaction_Name', how='outer', suffixes=['', '_'], indicator=True)
#filter rows which are missing in df2
missing_df2_rows = merge_df[merge_df['_merge'] =='left_only'][df1.columns]
#filter rows which are missing in df1
missing_df1_rows = merge_df[merge_df['_merge'] =='right_only'][df2.columns]
print missing_df2_rows
print missing_df1_rows
Output:
Count Transaction_Name
2 2.0 SC-003_Homepage
Count Transaction_Name
4 NaN SC-002_Signinlink
Related
I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```
pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
I am trying to merge two pandas DataFrames with one of many relationship.
import pandas as pd
df1 = pd.DataFrame({'name': ['AA', 'BB', 'CC'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df2 = pd.DataFrame({'name': ['AA', 'AA', 'BB'],
'col1': [1, 2, 3],
'col2': [1, 2, 3] })
df_merged = pd.merge(
df1,
df2,
left_on = 'name',
right_on = 'name',
how = "inner"
)
Two questions.
How do I join the two DataFrames using pd.merge without inserting new rows in df1? Shape of df1 needs not change. name is unique in df1.
For rows with one-to-many relationship, I'd like join the first row from df2.
When I merge the two DataFrames, it creates new columns - col1.x, col2.x, col1.y, col2.y? I'd like only copy of those columns.
Use left join and drop duplicates
df1.merge(df2, how='left', on='name').drop_duplicates(subset='name',keep='first')
I have two data frames; df1 has Id and sendDate and df2 has Id and actDate. The two df's are not the same shape - df2 is a lookup table. There may be multiple instances of Id.
ex.
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate.
Expected output would add a column to df1:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.
Use isin and boolean indexing
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True
The .isin() approaches will find values where the ID and date entries don't necessarily appear together (e.g. Id=1 and date=2020-09-11 in your example). You can check for both by doing a .merge() and checking when df2's date field is not null:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()
A vectorized approach via numpy -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)
You can use .isin():
df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)
Check out the documentation here.
I have two tables that I want to join - the main table has index SourceID, the sub-table is multi-indexed as it comes from a pivot table - indexes are (SourceID, sourceid)
How can I join a table with a single index to one with multi-index (or change the multi-indexed table to singular)?
The sub-table is created as follows:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
df_sub = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
# .add_prefix('sales_')
.reset_index()
)
L = [(a, f'{a.lower()}{b}') for a, b in df_sub.columns]
df_sub.columns = pd.MultiIndex.from_tuples(L)
df_sub = df_sub.reset_index()
I'm then trying to join it with the main table df_main
df_all = df_sub.join(df_main.set_index('SourceID'), on='SourceID.sourceid')
but this fails due to the multi-index. The index in the sub-table could be single as long as I don't lost the multi-index on the other fields.
It is possible, but then MultiIndex values are converted to tuples:
df_all = df_sub.join(df.set_index('SourceID'), on=[('SourceID','sourceid')])
print (df_all)
If want MultiIndex in output is necessary convert df columns to MultiIndex too, e.g. by MultiIndex.from_product:
df1 = df.copy()
df1.columns = pd.MultiIndex.from_product([['orig'], df1.columns])
df_all = df_sub.join(df1.set_index([('orig','SourceID')]), on=[('SourceID','sourceid')])
After renaming a DataFrame's column(s), I get an error when merging on the new column(s):
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = [['b']]
df1.merge(df2, on='b')
TypeError: only integer scalar arrays can be converted to a scalar index
When renaming columns, use DataFrame.columns = [list], not DataFrame.columns = [[list]]:
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = ['b']
df1.merge(df2, on='b')
# b
# 0 1
Replaced the code tmp.columns = [['POR','POR_PORT']] with tmp.rename(columns={'Locode':'POR', 'Port Name':'POR_PORT'}, inplace=True) and it worked.