Find duplicate rows among different groups with pandas

Find duplicate rows among different groups with pandas - python

Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.
If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.

What you need is essentially the AND of two different duplicated() calls.
~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True

As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8 as in the comment.

My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.

Related

Append values in new column based on 2 different condition in Python

I have a sample data set which is similar to the one defined below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test.head(10)
I have to create 2 new columns in test dataframe i.e. output_True & output_False based on given conditions:
a) If Id[0] == Id[1] & boolean_val = True then put sal[0](Because this is the value when boolean_val = True) in output_True else "NA".
b) If Id[0] == Id[1] & boolean_val = False then put sal[1](Because this is the value when boolean_val = False) in output_False else "NA".
c) If Id[0] 1= Id[1] & boolean_val == True then put sal value of that row in output_True else if Id[0] 1= Id[1] & boolean_val == False then put sal value of that row in output_False.
If I have not properly framed my question then please check below dataframe output and I want my output to be similar to output_True & output_False as shown below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500],
"output_True" : [1000, "NA", 1500, "NA", 3500, "NA"],
"output_False" : [2000, "NA", 2500, "NA", "NA", 4500]}
output_df = pd.DataFrame(dict_1)
output_df.head(10)
I have tried using np.where() & list comprehension but my output data is not showing me correct value. Can someone please help me with this?

Use loc to assign your values for the boolean column. For the second condition you can use .shift() and compare your Id[0] == Id[1] values and sum based on that:
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test
Id boolean_val sal
0 1 True 1000
1 1 False 2000
2 2 True 1500
3 2 False 2500
4 3 True 3500
5 4 False 4500
cond1 = test.boolean_val
test.loc[cond1, 'output_True'] = test.sal
cond2 = (test.Id.shift(-1).eq(test.Id))
test['output_False'] = np.nan
test.loc[cond2, 'output_False'] = test['sal'] + test['output_True']
test
Id boolean_val sal output_True output_False
0 1 True 1000 1000.0 2000.0
1 1 False 2000 NaN NaN
2 2 True 1500 1500.0 3000.0
3 2 False 2500 NaN NaN
4 3 True 3500 3500.0 NaN
5 4 False 4500 NaN NaN

Here's a way to get your desired output:
df = test.pivot(index='Id', columns='boolean_val', values='sal')
df = df.assign(boolean_val=df.loc[:,True].notna()).set_index('boolean_val', append=True)
df = df.rename(columns={True:'output_True', False:'output_False'})[['output_True', 'output_False']]
output_df = test.join(df, on=['Id','boolean_val'])
for col in ('output_True', 'output_False'):
output_df[col] = np.where(output_df[col].isna(), "NA", output_df[col].astype(pd.Int64Dtype()))
Output:
Id boolean_val sal output_True output_False
0 1 True 1000 1000 2000
1 1 False 2000 NA NA
2 2 True 1500 1500 2500
3 2 False 2500 NA NA
4 3 True 3500 3500 NA
5 4 False 4500 NA 4500
Explanation:
use pivot() to create an intermediate dataframe df with True and False columns containing the corresponding sal values for each Id
add a boolean_val column which contains True unless a given row's True column is NaN
set Id, boolean_val as the index for df
rename the True and False columns as output_True and output_False and swap their positions (to match the desired output)
use join() to create output_df which is test with added columns output_Trueandoutput_False`
replace NaN with the string "NA" and change sal values from float to int in output_True and output_False.

Pandas get equivalent positive number

I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.

Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite

To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object

import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)

Pair rows in Pandas DataFrame if condition met

Say I have a DataFrame like following:
df = pd.DataFrame({
'group':['A', 'A', 'A', 'B', 'B', 'C'],
'amount':[100, -100, 50, 30, -30, 40]
})
If I would like to add another column as to check if the amount of each row can be paired (i.e. same amount, but 1 positive and 1 negative) within a group.
For example, in group A, 100 & -100 can be paired, then they will be True, while 50 cannot find a pair, then it's False (like the following table).
group
amount
pair
A
100
True
A
-100
True
A
50
False
B
30
True
B
-30
True
C
40
False
What would be the most efficient way to do this?

We can take the abs of the amount column, then create the pair column based on where the values are DataFrame.duplicated:
df['pair'] = df.assign(amount=df['amount'].abs()).duplicated(keep=False)
*keep=False means both duplicated rows get True. The right hand side can also be subset if the DataFrame has more than these 2 columns.
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 B 30 True
4 B -30 True
5 C 40 False
Update to handle duplicate values, but ensure only positive negative pairs get matched using pivot_table:
Updated DataFrame:
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'A', 'B', 'B', 'C'],
'amount': [100, -100, 50, 50, 30, -30, 40]
})
Pivot to wide form and check for pairs:
df['abs_amount'] = df['amount'].abs()
df = df.join(
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
.notnull().all(axis=1)
.rename('pair'),
on=['group', 'abs_amount']
).drop('abs_amount', axis=1)
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 A 50 False
4 B 30 True
5 B -30 True
6 C 40 False
The pivot_table:
df['abs_amount'] = df['amount'].abs()
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
amount False True
group abs_amount
A 50 NaN 50.0 # Multiple 50s but no -50
100 -100.0 100.0
B 30 -30.0 30.0
C 40 NaN 40.0
Ensure all values in the row:
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first').notnull().all(axis=1)
group abs_amount
A 50 False
100 True
B 30 True
C 40 False
dtype: bool

omit groups in pandas groupby based on a condition

This is my dataframe:
df = pd.DataFrame({'sym': list('aaaaaabb'), 'key': [1, 1, 1, 1, 2, 2, 3, 3], 'x': [100, 100, 90, 100, 500, 500, 700, 700]})
I group them by key and sym:
groups = df.groupby(['key', 'sym'])
Now I want to check whether all x in each group are equal or not. If they are not equal, I want to delete it from the df. In this case I want to omit the first group.
This is my desired df:
key sym x
4 2 a 500
5 2 a 500
6 3 b 700
7 3 b 700

Use GroupBy.transform with SeriesGroupBy.nunique and compare by 1, filter by boolean indexing:
df1 = df[df.groupby(['key', 'sym'])['x'].transform('nunique').eq(1)]
print (df1)
sym key x
4 a 2 500
5 a 2 500
6 b 3 700
7 b 3 700

Union of possible combinations of two columns

My DataFrame looks like this:
A B
100 1
100 2
200 2
200 3
I need to find all possible combinations of A and B values and create new dataframe with this combinations and a third column indicating each combination presence in the original df:
A B C
100 1 True
100 2 True
100 3 False
200 1 False
200 2 True
200 3 True
How I'm doing it now:
import pandas as pd
df = pd.DataFrame({'A' : [100,100,200,200], 'B' : [1,2,2,3]})
df['D'] = 42
df2 = df[['A','D']].merge(df[['B','D']], on = 'D')
[['A','B']].drop_duplicates()
i1 = df.set_index(['A','B']).index
i2 = df2.set_index(['A','B']).index
df2['C'] = i2.isin(i1)
print(df2)
It works, but looks ugly. Is there a cleaner way?

You can use:
create new column filled Trues
set_index from columns for all combinations
create MultiIndex.from_product from levels of df1 index
reindex original df and if not exist values add Falses
reset_index for columns from MultiIndex
df['C'] = True
df1 = df.set_index(['A','B'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df = df1.reindex(mux, fill_value=False).reset_index()
print (df)
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True

With the help of itertools and tuple
import itertools
newdf = pd.DataFrame(list(itertools.product(df['A'].unique(),df['B'].unique())),columns = df.columns)
dft = list(df.itertuples(index=False))
newdf['C'] = newdf.apply(lambda x: tuple(x) in dft,axis=1)
Output :
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True

Using cartesian_product and pd.merge
In [415]: combs = pd.core.reshape.util.cartesian_product(
df.set_index(['A', 'B']).index.levels)
In [416]: combs
Out[416]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]
In [417]: (pd.DataFrame({'A': combs[0], 'B': combs[1]})
.merge(df, how='left', indicator='C')
.replace({'C': {'both': True, 'left_only': False}}) )
Out[417]:
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
For combs, you could also,
In [432]: pd.core.reshape.util.cartesian_product([df.A.unique(), df.B.unique()])
Out[432]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find duplicate rows among different groups with pandas - python

My idea for this problem: import datatable as dt df = dt.Frame(df_so) df[:1, :, dt.by("ID", "letter")] I would group by both the ID and letter column. Then simply select the first row.

Related

Append values in new column based on 2 different condition in Python

Pandas get equivalent positive number

Pair rows in Pandas DataFrame if condition met

omit groups in pandas groupby based on a condition

Union of possible combinations of two columns

Categories

Resources