Most efficient way to find what rows differ in pandas dataframe? - python

How do i find the most efficient way to check what rows differ in a pandas dataframe?
Imagine we have the following pandas dataframes, df1 and df2.
df1 = pd.DataFrame([[a,b],[c,d],[e,f]], columns=['First', 'Last'])
df2 = pd.DataFrame([[a,b],[e,f],[g,h]], columns=['First', 'Last'])
In this case, row index 0 of df1 would be [a,b]; row index 1 of df1 would be [c,d] etc
I want to know what is the most efficient way to find what rows these dataframes differ.
In particular, although [e,f] appears in both dataframes - in df1 it is at index 2 and in df2 it is in index 1, I would want my outcome to show this.
something like diff(df1,df2) = [1,2]
I know I could loop through all the rows and check if df1.loc[i,:] == df2.loc[i,:] for i in range(len(df1)) but is there a more efficient way?

You may be looking for this :
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
From https://stackoverflow.com/a/57812527/15179457.

Related

Concatenating values into column from multiple rows

I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Compare content of two pandas dataframes even if the rows are differently ordered

I have two pandas dataframes, which rows are in different orders but contain the same columns.
My goal is to easily compare the two dataframes and confirm that they both contain the same rows.
I have tried the "equals" function, but there seems to be something I am missing, because the results are not as expected:
df_1 = pd.DataFrame({1: [10,15,30], 2: [20,25,40]})
df_2 = pd.DataFrame({1: [30,10,15], 2: [40,20,25]})
df_1.equals(df_2)
I would expect that the outcome returns True, because both dataframes contain the same rows, just in a different order, but it returns False.
You can specify columns for sorting in DataFrame.sort_values - in my solution sorting by all columns and DataFrame.reset_index with drop=True for default indices in both DataFrames:
df11 = df_1.sort_values(by=df_1.columns.tolist()).reset_index(drop=True)
df21 = df_2.sort_values(by=df_2.columns.tolist()).reset_index(drop=True)
print (df11.equals(df21))
True
Try sorting and reseting index
df_1.sort_values(by=[1,2]).equals(df_2.sort_values(by=[1,2]).reset_index(drop=True))
Here is the solution for the situation if both columns and rows of two similar dataframes are differently ordered:
Order the columns
df_1 = df_1[df_2.columns]
Choose the column to sort the rows w.r.t.
ref_len = 0
for col in df_1.columns:
if(len(set(df_1[col])) > ref_len):
final_col = col
ref_len = len(set(df_1[col]))
Order the rows
df11 = df_1.sort_values(by=[final_col]).reset_index(drop=True)
df21 = df_2.sort_values(by=[final_col]).reset_index(drop=True)
df11.equals(df21)

Update dataframe according to another dataframe based on certain conditions

I have two dataframes df1 and df2. Df1 has columns A,B,C,D,E,F and df2 A,B,J,D,E,K. I want to update the second dataframe with the rows of the first but only when two first columns have the same value in both dataframes. For each row that the following two conditions are true:
df1.A = df2.A
df1.B = df2.B
then update accordingly:
df2.D = df1.D
df2.E = df1.E
My dataframes have different number of rows.
When I tried this code I get a TypeError :cannot do positional indexing with these indexers of type 'str'.
for a in df1:
for t in df2:
if df1.iloc[a]['A'] == df2.iloc[t]['A'] and df1.iloc[a]['B'] == df2.iloc[t]['B']:
df2.iloc[t]['D'] = df1.iloc[a]['D']
df2.iloc[t]['E'] = df1.iloc[a]['E']
The Question:
You'd be better served merging the dataframes than doing nested iteration.
df2 = df2.merge(df1[['A', 'B', 'D', 'E']], on=['A', 'B'], how='left', suffixes=['_old', ''])
df2['D'] = df2['D'].fillna(df2['D_old'])
df2['E'] = df2['E'].fillna(df2['E_old'])
del df2['D_old']
del df2['E_old']
The first row attaches columns to df2 with values for columns D and E from corresponding rows of df1, and renames the old columns.
The next two lines fill in the rows for which df1 had no matching row, and the next two delete the initial, now outdated versions of the columns.
The Error:
Your TypeError happened because for a in df1: iterates over the columns of a dataframe, which are strings here, while .iloc only takes integers. Additionally, though you didn't get to this point, to set a value you'd need both index and column contained within the brackets.
So if you did need to set values by row, you'd want something more like
for a in df1.iterrows():
for t in df2.iterrows():
if df1.loc[a, 'A'] == ...
Though I'd strongly caution against doing that. If you find yourself thinking about it, there's probably either a much faster, less painful way to do it in pandas, or you're better off using another tool less focused on tabular data.

Pandas: best way to replicate df and fill with new values

Suppose I have df1:
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
I would like to create df2 with the same shape, index and columns. I often find myself doing something like this:
df2= pd.DataFrame(np.ones(shape(df1),index = df1.index,columns =df1.columns)
This is less than ideal. What's the pythonic way?
How about this:
df2 = df1.copy()
df2[:] = 1 # Or any other value, for the matter
The last line is not even necessary if all you want is to preserve the shape and the row/column headers.
You can also use the dataframe method "where" which will allow you to keep data based on condition and preserve the shape/index of the original df.
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
df2= df1.where(df1['foo'] % 2 == 0, 9999)
df2

Categories