Compre two dataframes on multiple columns - python

I have two dataframes, they both have the same columns. I want to compare them both and find for each two rows that are different, on which column they have different values
my dataframes are as follow:
the column A is unique key both dataframes share
df1
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
df2
A B C D E
0 V 30 5 18 20
1 W 9 18 11 9
2 X 8 7 12 5
3 Y 36 9 7 8
4 Z 6 5 3 90
expected result:
df3
A key
0 V B
1 W E
3 Y B
What i've tried so far is:
df3 = df1.merge(df2, on=['A', 'B', 'C', 'D', 'E'], how='outer', indicator=True)
df3 = df3[df3._merge != 'both'] #to retrieve only the rows where there's a difference spotted
This is what I get for df3
A B C D E _merge
0 V 10 5 18 20 left_only
1 W 9 18 11 13 left_only
3 Y 7 9 7 8 left_only
5 V 30 5 18 20 right_only
6 W 9 18 11 9 right_only
8 Y 36 9 7 8 right_only
How can I achieve the expected result ?

In your case you can set the index first then eq
s = df1.set_index('A').eq(df2.set_index('A'))
s.mask(s).stack().reset_index()
Out[442]:
A level_1 0
0 V B False
1 W E False
2 Y B False

You can find the differences between the two frames and use idxmax with axis=1 to get the differing column:
diff = df1.set_index("A") - df2.set_index("A")
result = diff[diff.ne(0)].abs().idxmax(1).dropna()
>>> result
A
V B
W E
Y B
dtype: object

Related

VLOOKUP on Python

I have got two dataframes:
df1:
Index a b c d e
1 1 X 10 12 A
2 1 Y 11 13 B
3 1 Z 12 14 C
4 1 W 13 15 C
5 1 A 14 49 D
df2:
Index b f
1 X YES
2 Y YES
3 Z YES
4 W YES
I would like to VLOOKUP the values in column 'b' and report column 'f' to df1.
I tried running the following code but does not work:
new_df = df1.merge(df2, on='b', how='left')
My output should look like as follows:
Index a b c d e f
1 1 X 10 12 A YES
2 1 Y 11 13 B YES
3 1 Z 12 14 C YES
4 1 W 13 15 C YES
5 1 A 14 49 D NaN
Note that df1 has 3400 rows, while df2 only 30.
You can also use list comprehension:
vlookup = ['Yes' if df['b'][i] in list(df2['b']) else np.nan for i in range(df.shape[0])]
Here is the output:
df['vlookup'] = vlookup
a b c d e vlookup
0 1 X 10 12 A Yes
1 1 Y 11 13 B Yes
2 1 Z 12 14 C Yes
3 1 W 13 15 C Yes
4 1 A 14 49 D NaN
Okay, you can use map using a pd.Series defined by df2 dataframe:
df1['f'] = df1['b'].map(df2.set_index('b')['f'])
df1
Output:
a b c d e f
Index
1 1 X 10 12 A YES
2 1 Y 11 13 B YES
3 1 Z 12 14 C YES
4 1 W 13 15 C YES
5 1 A 14 49 D NaN
First create a pd.Series using df2.set_index('b')['f'] then map the values in df1['b'] to create the column df1['f'].

map values in a dataframe according to ranges

I have a dataframe df
import pandas
df = pandas.DataFrame(data=[1,2,3,2,2,2,3,3,4,5,10,11,12,1,2,1,1], columns=['codes'])
codes
0 1
1 2
2 3
3 2
4 2
5 2
6 3
7 3
8 4
9 5
10 10
11 11
12 12
13 1
14 2
15 1
16 1
and I would like to group the values in the column code
according to a specific logic:
values == 0 become A
values in the range (1,4) becomes B
values == 5 becomes C
values in the range (6,16) becomes D
is there a way to keep the logic and the dataframe separate so that it is easy to change the grouping rules in the future?
I would like to avoid to write
df.loc[df['code']==0,'code']=A
df.loc[(df['code']>=1 & df['code']<=4),'code']=B
First idea is use Series.map with merge dictionaries, second is use cut with right=False:
df = pd.DataFrame(data=[0,1,2,3,2,2,2,3,3,4,5,10,11,12,16,2,17,1], columns=['codes'])
d1 = {0: 'A', 5:'C'}
d2 = dict.fromkeys(range(1,5), 'B')
d3 = dict.fromkeys(range(6,17), 'D')
d = {**d1, **d2, **d3}
df['codes1'] = df['codes'].map(d)
df['codes2'] = pd.cut(df['codes'], bins=(0,1,5,6,17), labels=list('ABCD'), right=False)
print (df)
codes codes1 codes2
0 0 A A
1 1 B B
2 2 B B
3 3 B B
4 2 B B
5 2 B B
6 2 B B
7 3 B B
8 3 B B
9 4 B B
10 5 C C
11 10 D D
12 11 D D
13 12 D D
14 16 D D
15 2 B B
16 17 NaN NaN
17 1 B B

How to drop duplicates in python if consecutive values are the same in two columns?

I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12

Fill all values in a group with the first non-null value in that group

The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).

compare multiple columns of pandas dataframe with one column

I have a dataframe:
df-
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
I want to add a column 'Result' which should return 1 if the value in column 'E' is greater than the values in B, C & D columns else return 0.
Output should be:
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
For few columns, i would use logic like : if(and(E>B,E>C,E>D),1,0),
But I have to compare around 20 columns (from B to U) with column name 'V'. Additionally, the dataframe has around 100 thousand rows.
I am using
df['Result']=np.where((df.ix[:,1:20])<df['V']).all(1),1,0)
And it gives a Memory error.
One possible solution is compare in numpy and last convert boolean mask to ints:
df['Result'] = (df.iloc[:, 1:4].values < df[['E']].values).all(axis=1).astype(int)
print (df)
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1

Categories