Merge a dataframe only when the column values are identical - python

I have two data frames df and df_copy. I would like to copy the data from df_copy, but only if the data is also identical. How do I do that?
import pandas as pd
d = {'Nameid': [100, 200, 300, 100]
, 'Name': ['Max', 'Michael', 'Susan', 'Max']
, 'Projectid': [100, 200, 200, 100]}
df = pd.DataFrame(data=d)
display(df.head(5))
df['nameid_index'] = df['Nameid'].astype('category').cat.codes
df['projectid_index'] = df['Projectid'].astype('category').cat.codes
display(df.head(5))
df_copy = df.copy()
df.drop(['Nameid', 'Name', 'Projectid'], axis=1, inplace=True)
df = df.drop([1, 3])
display(df.head(5))
df
df_copy
What I want
I looked at Pandas Merging 101
df.merge(df_copy, on=['nameid_index', 'projectid_index'])
But I got this result
The same row are twice, I only want once.

Use DataFrame.drop_duplicates first:
df1 = (df.drop_duplicates(['nameid_index', 'projectid_index'])
.merge(df_copy, on=['nameid_index', 'projectid_index']))
If need merge by intersection of columns names in both DataFrames, on parameter should be removed:
df1 = df.drop_duplicates(['nameid_index', 'projectid_index']).merge(df_copy)

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```
pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)

Return list of columns with string match in multiple DataFrames

I have some DataFrames:
d = {'colA': [1, 2], 'colB': [3, 4]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df3 = pd.DataFrame(data=d)
I want to return a list of columns containing the string 'A', e.g. for one DataFrame:
[column for column in df.columns if 'A' in column]
How can I do this for multiple DataFrames (e.g., df, df2, df3)?
The desired output in this example would be ['colA', 'colA', 'colA']
Here is a way:
l = [','.join(i.columns[i.columns.str.contains('A')]) for i in [df,df2,df3]]
You can create a function for that:
def find_char(list_of_df , char):
result = []
for df in list_of_df:
for c in df.columns:
if char in c:
result.append(c)
return result
Usage:
res = find_char([df,df2,df3] , 'A')
print(res)
['colA', 'colA', 'colA']

Joining a pandas table with multi-index

I have two tables that I want to join - the main table has index SourceID, the sub-table is multi-indexed as it comes from a pivot table - indexes are (SourceID, sourceid)
How can I join a table with a single index to one with multi-index (or change the multi-indexed table to singular)?
The sub-table is created as follows:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
df_sub = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
# .add_prefix('sales_')
.reset_index()
)
L = [(a, f'{a.lower()}{b}') for a, b in df_sub.columns]
df_sub.columns = pd.MultiIndex.from_tuples(L)
df_sub = df_sub.reset_index()
I'm then trying to join it with the main table df_main
df_all = df_sub.join(df_main.set_index('SourceID'), on='SourceID.sourceid')
but this fails due to the multi-index. The index in the sub-table could be single as long as I don't lost the multi-index on the other fields.
It is possible, but then MultiIndex values are converted to tuples:
df_all = df_sub.join(df.set_index('SourceID'), on=[('SourceID','sourceid')])
print (df_all)
If want MultiIndex in output is necessary convert df columns to MultiIndex too, e.g. by MultiIndex.from_product:
df1 = df.copy()
df1.columns = pd.MultiIndex.from_product([['orig'], df1.columns])
df_all = df_sub.join(df1.set_index([('orig','SourceID')]), on=[('SourceID','sourceid')])

Finding out missing transactions in two excels by using Python

I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df1 I could see that there is one extra transaction called SC-003_Homepage which is not there in df2. Can someone help me how to find only that transaction which is missing in df2?
So far I had done below work to get the transactions.
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='inner')
Maybe a simple set will do the job
set(df1['Transaction_Name']) - set(df2['Transaction_Name'])
Add a merger column and then filter the missing data based on that. see below example.
For more information see merge documentation
import pandas as pd
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
#create a merged df
merge_df = df1.merge(df2, on='Transaction_Name', how='outer', suffixes=['', '_'], indicator=True)
#filter rows which are missing in df2
missing_df2_rows = merge_df[merge_df['_merge'] =='left_only'][df1.columns]
#filter rows which are missing in df1
missing_df1_rows = merge_df[merge_df['_merge'] =='right_only'][df2.columns]
print missing_df2_rows
print missing_df1_rows
Output:
Count Transaction_Name
2 2.0 SC-003_Homepage
Count Transaction_Name
4 NaN SC-002_Signinlink

Merge Pandas DataFrames on Two Column Values Irrespective of Order in Row

Given two dataframes:
df1 = pd.DataFrame([
['Red', 'Blu', 1.1],
['Yel', 'Blu', 2.1],
['Grn', 'Grn', 3.1]], columns=['col_1a','col_1b','score_1'])
df2 = pd.DataFrame([
['Blu', 'Red', 1.2],
['Yel', 'Blu', 2.2],
['Vio', 'Vio', 3.2]], columns=['col_2a','col_2b','score_2'])
I want to merge them on two columns like below:
df3 = pd.DataFrame([
['Blu', 'Red', 1.1, 1.2],
['Yel', 'Blu', 2.1, 2.2],
], columns=['col_a','col_b','score_1','score_2'])
Caveat 1: The order of column contents can switch between dataframes to merge. The first row, for example, should be merged because it contains both "Red" and "Blue" even if they appear in different columns.
Caveat 2: The order of columns in the final df_3 is unimportant. Whether "Blu" is in col_a or col_b doesn't mean anything.
Caveat 3: Anything else not matching, like the last row, is ignored
You can sort the first two columns along the rows, then merge on them:
# rename column names
cols = ['col_a', 'col_b']
df1.columns = cols + ['score_1']
df2.columns = cols + ['score_2']
# sort the two id columns along the row
df1[cols] = pd.np.sort(df1[cols], axis=1)
df2[cols] = pd.np.sort(df2[cols], axis=1)
# merge
df1.merge(df2)

Categories