I have two dataframes and wanted to check if they contain the same data or not.
df1:
df1 = [['tom', 10],['nick',15], ['juli',14]]
df1 = pd.DataFrame(df1, columns = ['Name', 'Age'])
df2:
df2 = [['nick', 15],['tom', 10], ['juli',14]]
df2 = pd.DataFrame(df2, columns = ['Name', 'Age'])
Note that the information between them are exactly the same. The only difference is the row order.
I've created a code to match both dataframes, but it's showing that the dataframes are different on the first two rows:
ne = (df != df2).any(1)
ne_stacked = (df != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df2)
changed_from = df.values[difference_locations]
changed_to = df2.values[difference_locations]
divergences = pd.DataFrame({'df1': changed_from, "df2": changed_to}, index=changed.index)
print(divergences)
I am receiving the below result:
GRID SPX RECAP
id col
0 Name tom nick
Age 10 15
1 Name nick tom
Age 15 10
I was expecting to receive:
Empty DataFrame
Columns: [df1, df2]
Index: []
How I change the code so they can test each row on dataframes to check if they are matched?
And if I was comparing two data frames with different number of rows?
When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)
One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31
I'll try to keep this this short and to the point (with simplified data). I have a table of data that has four columns (keep in mind more columns may be added later), none of which are unique on their own, but these three columns together 'ID','ID2','DO' must be unique as a group. I will bring this table into one dataframe, and the updated version of the table into another dataframe.
If df is the 'original data' and df2 is the 'updated data', is this the most accurate/efficient way to find what changes occur to the original data?
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
>>> df
DATA DO ID ID2
0 ORIG 3 546 AUSER
1 ORIG 6 107 BUSER
2 ORIG 8 478 CUSER
3 ORIG 4 546 AUSER
4 ORIG 6 478 EUSER
>>> df2
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
2 CHANGE 2 123 DUSER
3 ORIG 4 546 AUSER
4 CHANGE 3 123 FUSER
#Compare Dataframes
merged = df2.merge(df, indicator=True, how='outer')
#Split the merged comparison into:
# - original records that will be updated or deleted
# - new records that will be inserted or update the original record.
df_original = merged.loc[merged['_merge'] == 'right_only'].drop(columns=['_merge']).copy()
df_new = merged.loc[merged['_merge'] == 'left_only'].drop(columns=['_merge']).copy()
#Create another merge to determine if the new records will either be updates or inserts
check = pd.merge(df_new,df_original, how='left', left_on=['ID','ID2','DO'], right_on = ['ID','ID2','DO'], indicator=True)
in_temp = check[['ID','ID2','DO']].loc[check['_merge']=='left_only']
upd_temp = check[['ID','ID2','DO']].loc[check['_merge']=='both']
#Create dataframes for each Transaction:
# - removals: Remove records based on provided key values
# - updates: Update entire record based on key values
# - inserts: Insert entire record
removals = pd.concat([df_original[['ID','ID2','DO']],df_new[['ID','ID2','DO']],df_new[['ID','ID2','DO']]]).drop_duplicates(keep=False)
updates = df2.loc[(df2['ID'].isin(upd_temp['ID']))&(df2['ID2'].isin(upd_temp['ID2']))&(df2['DO'].isin(upd_temp['DO']))].copy()
inserts = df2.loc[(df2['ID'].isin(in_temp['ID']))&(df2['ID2'].isin(in_temp['ID2']))&(df2['DO'].isin(in_temp['DO']))].copy()
results:
>>> removals
ID ID2 DO
6 478 CUSER 8
8 478 EUSER 6
>>> updates
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
>>> inserts
DATA DO ID ID2
2 CHANGE 2 123 DUSER
4 CHANGE 3 123 FUSER
To restate the questions. Will this logic consistently and correctly identify the differences between two dataframes with specified key columns? Is there a more efficient or pythonic approach to this?
Updated Sample Data with more records and the corresponding results.
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546], 'ID2':['AUSER','BUSER','CUSER','AUSER'], 'DO':[3,6,8,4], 'DATA':['ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546], 'ID2':['BUSER','AUSER','DUSER','AUSER'], 'DO':[6,3,2,4], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG']})
For changed:
#Concat both df and df2 together, and whenever there is two of the same, drop them both
df3 = pd.concat([df, df2]).drop_duplicates(keep = False)
#Whenever the size of this following group by is 2 or more there was a change.
#Change
df3 = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 2')
df3.loc[:, 'DATA'] = 'CHANGE'
ID ID2 DO DATA
0 107 BUSER 6 CHANGE
3 546 AUSER 3 CHANGE
For Inserts:
#We can compare the ID comlumn for df and df2 and see whats new in df2
#Inserts
df2[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 123 DUSER 2 CHANGE
For Removals:
#Similar logic as above but flipped.
#Removals
df[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 478 CUSER 8 ORIG
EDIT
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
New dataframes. For changed we will do it the exact same way:
df3 = pd.concat([df, df2]).drop_duplicates(keep = False)
#Change
Change = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 2')
Change.loc[:, 'DATA'] = 'CHANGE'
ID ID2 DO DATA
0 107 BUSER 6 CHANGE
5 546 AUSER 3 CHANGE
For inserts/ removals we will do the same groupby as above, except query for the ones that only appear once. Then we will follow up with an inner join with both df and df2 to see what has been added/removed.
InsertRemove = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 1')
#Inserts
Inserts = InsertRemove.merge(df2, how = 'inner', left_on= ['ID', 'ID2', 'DO'], right_on = ['ID', 'ID2', 'DO'])\
.drop('DATA_x', axis = 1)\
.rename({'DATA_y':'DATA'}, axis = 1)
ID ID2 DO DATA
0 123 DUSER 2 CHANGE
1 123 FUSER 3 CHANGE
#Removals
Remove = InsertRemove.merge(df, how = 'inner', left_on= ['ID', 'ID2', 'DO'], right_on = ['ID', 'ID2', 'DO'])\
.drop('DATA_x', axis = 1)\
.rename({'DATA_y':'DATA'}, axis = 1)
ID ID2 DO DATA
0 478 CUSER 8 ORIG
1 478 EUSER 6 ORIG
Say I have two data frames:
df1:
A
0 a
1 b
df2:
A
0 a
1 c
I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:
A B
0 a df1, df2
1 b df1
2 c df2
I can get the concatenated data frame (df3) without duplicates as follows:
import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)
I can't think of/find a method to have control over what element goes where. How can I add the extra column?
Thank you very much for any tips.
Merge with an indicator argument, and remap the result:
m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}
result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)
result
A B
0 a df1, df2
1 b df1
2 c df2
Use the command below:
df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) \
.groupby('A') \
.aggregate(list) \
.reset_index()
The result will be:
A source
0 a [df1, df2]
1 b [df1]
2 c [df2]
The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.
We use outer join to solve this -
df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df
A B
0 a df1,df2
1 b df1
2 c df2
I'm trying to merge two dataframes.
I want to merge on one column, that is the index of the second DataFrame and
one column, that is a column in the second Dataframe. The column/index names are different in both DataFrames.
Example:
import pandas as pd
df2 = pd.DataFrame([(i,'ABCDEFGHJKL'[j], i*2 + j)
for i in range(10)
for j in range(10)],
columns = ['Index','Sub','Value']).set_index('Index')
df1 = pd.DataFrame([['SOMEKEY-A',0,'A','MORE'],
['SOMEKEY-B',4,'C','MORE'],
['SOMEKEY-C',7,'A','MORE'],
['SOMEKEY-D',5,'Z','MORE']
], columns=['key', 'Ext. Index', 'Ext. Sub', 'Description']
).set_index('key')
df1 prints out
key Ext. Index Ext. Sub Description
SOMEKEY-A 0 A MORE
SOMEKEY-B 4 C MORE
SOMEKEY-C 7 A MORE
SOMEKEY-D 5 Z MORE
the first lines of df2 are
Index Sub Value
0 A 0
0 B 1
0 C 2
0 D 3
0 E 4
I want to merge "Ext. Index" and "Ext. Sub" with DataFrame df2, where the index is "Index" and the column is "Sub"
The expected result is:
key Ext. Index Ext. Sub Description Ext. Value
SOMEKEY-A 0 A MORE 0
SOMEKEY-B 4 C MORE 10
SOMEKEY-C 7 A MORE 14
SOMEKEY-D 5 Z MORE None
Manually, the merge works like this
def get_value(x):
try:
return df2[(df2.Sub == x['Ext. Sub']) &
(df2.index == x['Ext. Index'])]['Value'].iloc[0]
except IndexError:
return None
df1['Ext. Value'] = df1.apply(get_value, axis = 1)
Can I do this with a pd.merge or pd.concat command, without
changing the df2 by turning the df2.index into a column?
Try using:
df_new = (df1.merge(df2[['Sub', 'Value']],
how='left',
left_on=['Ext. Index', 'Ext. Sub'],
right_on=[df2.index, 'Sub'])
.set_index(df1.index)
.drop('Sub', axis=1))