How to conditionally remove duplicates from a pandas dataframe - python

Consider the following dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
'B' : ['a','b','c','c','d','d','e','f','g'],
'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]:
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
3 3 c NaN 3
4 4 d B 3
5 4 d NaN 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
I want to remove all rows which are duplicates with regards to column 'A' 'B'. I want to remove the entry which has a NaN entry (I know that for all dulicates there will be a NaN and a not-NaN entry). The end results should look like this
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
All efficient, one-liners are most welcome

If the goal is to only drop the NaN duplicates, a slightly more involved solution is needed.
First, sort on A, B, and Col_1, so NaNs are moved to the bottom for each group. Then call df.drop_duplicates with keep=first:
out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first')
print(out)
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5

Here's an alternative:
df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
# A B Col_1 Col_2
# 0 1 a NaN 2
# 1 2 b A 2
# 2 3 c A 3
# 4 4 d B 3
# 6 5 e B 4
# 7 6 f NaN 4
# 8 7 g NaN 5
This uses the bitwise "not" operator ~ to negate rows that meet the joint condition of being a duplicate row (the argument keep=False causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False) returns this Series:
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# 8 False
...and the expression df.isnull().any(axis=1) returns this Series:
# 0 True
# 1 False
# 2 False
# 3 True
# 4 False
# 5 True
# 6 False
# 7 True
# 8 True
... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )), like so:
~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)
# 0 True
# 1 True
# 2 True
# 3 False
# 4 True
# 5 False
# 6 True
# 7 True
# 8 False
You can build more complex conditions with further use of the logical operators & and | (the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ].

Or you can just using first(), by using the first , will give back the first notnull value, so the order of original input does not really matter.
df.groupby(['A','B']).first()
Out[180]:
Col_1 Col_2
A B
1 a NaN 2
2 b A 2
3 c A 3
4 d B 3
5 e B 4
6 f NaN 4
7 g NaN 5

Related

replace values by condition after group by

So I have a dataframe like the one below.
dff = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], 'categ':['A','A','A','B','C','A','A','A','B','C','A','A','A','B','C'],'cost':[3,1,1,3,10,1,2,3,4,10,2,2,2,4,13] })
dff
id categ cost
0 1 A 3
1 1 A 1
2 1 A 1
3 1 B 3
4 1 C 10
5 2 A 1
6 2 A 2
7 2 A 3
8 2 B 4
9 2 C 10
10 3 A 2
11 3 A 2
12 3 A 2
13 3 B 4
14 3 C 13
Now i want to make a new grouped by 'id' dataframe and create a new column where if the sum of category A = 50% and B = 30% of the cost of C, then return True, otherwise false. My desired output is the one below.
new
id
1 True
2 False
3 False
I have tried some stuff but i can't make it work. Any idea on how to get my desired output? Thanks
Try pivot data frame first and then check if columns A, B, C satisfy the condition:
import numpy as np
dff.pivot_table('cost', 'id', 'categ', aggfunc='sum')\
.assign(new = lambda df: np.isclose(df.A, 0.5 * df.C) & np.isclose(df.B, 0.3 * df.C))
categ A B C new
id
1 5 3 10 True
2 6 4 10 False
3 6 4 13 False
Try with pd.crosstab with normalize, then apply a little bit math.
Notice : here we can not use equal due to float, we need np.isclose
s = pd.crosstab(df['id'], df['categ'], df['cost'],aggfunc='sum',normalize = 'index')
s['new'] = np.isclose(s.values.tolist(),[0.5/1.8,0.3/1.8,1/1.8],atol=0.0001).all(1)
s
Out[341]:
categ A B C new
id
1 0.277778 0.166667 0.555556 True
2 0.300000 0.200000 0.500000 False
3 0.260870 0.173913 0.565217 False

Replace repetitive number with NAN values except the first, in pandas column

I have a data frame like this,
df
col1 col2
1 A
2 A
3 B
4 C
5 C
6 C
7 B
8 B
9 A
Now we can see that there is continuous occurrence of A, B and C. I want only the rows where the occurrence is starting. And the other values of the same occurrence will be nan.
The final data frame I am looking for will look like,
df
col1 col2
1 A
2 NA
3 B
4 C
5 NA
6 NA
7 B
8 NA
9 A
I can do it using for loop and comparing, But the execution time will be more. I am looking for pythonic way to do it. Some panda shortcuts may be.
Compare by Series.shifted values and missing values by Series.where or numpy.where:
df['col2'] = df['col2'].where(df['col2'].ne(df['col2'].shift()))
#alternative
#df['col2'] = np.where(df['col2'].ne(df['col2'].shift()), df['col2'], np.nan)
Or by DataFrame.loc with inverted condition by ~:
df.loc[~df['col2'].ne(df['col2'].shift()), 'col2'] = np.nan
Or thanks #Daniel Mesejo - use eq for ==:
df.loc[df['col2'].eq(df['col2'].shift()), 'col2'] = np.nan
print (df)
col1 col2
0 1 A
1 2 NaN
2 3 B
3 4 C
4 5 NaN
5 6 NaN
6 7 B
7 8 NaN
8 9 A
Detail:
print (df['col2'].ne(df['col2'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
Name: col2, dtype: bool

How to separate null and non-null containing rows into two different DataFrames?

Say I have a big DataFrame (>10000 rows) that has some rows containing one or more nulls. How do I remove all the rows containing a null in one or more of its columns from the original DataFrame and putting the rows into another DataFrame?
e.g.:
Original DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
3 NaN 5 4
4 "foo" NaN 1
Non-Null DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
Null containing DataFrame:
a b c
1 NaN 5 4
2 "foo" NaN 1
Use DataFrame.isna for checking missing values:
print (df.isna())
#print (df.isnull())
a b c
1 False False False
2 False False False
3 True False False
4 False True False
And test if at least True per row by DataFrame.any:
mask = df.isna().any(axis=1)
#oldier pandas versions
mask = df.isnull().any(axis=1)
print (mask)
1 False
2 False
3 True
4 True
dtype: bool
Last filter by boolean indexing - ~ is for inverting boolean mask:
df1 = df[~mask]
df2 = df[mask]
print (df1)
a b c
1 foo 5.0 3
2 bar 9.0 1
print (df2)
a b c
3 NaN 5.0 4
4 foo NaN 1

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Combine data from two columns into one, except if second is already occupied in pandas

Say I have two columns in a data frame, one of which is incomplete.
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b':[5, '', 6, '']})
df
Out:
a b
0 1 5
1 2
2 3 6
3 4
is there a way to fill the empty values in column b with the corresponding values in column a whilst leaving the rest of column b intact?
such that you obtain without iterating over the column?
df
Out:
a b
0 1 5
1 2 2
2 3 6
3 4 4
I think you can use the apply method - but I am not sure. For reference the dataset I'm dealing with is quite large (appx 1GB) which is why iteration - my first attempt was not a good idea.
If blanks are empty strings, you could
In [165]: df.loc[df['b'] == '', 'b'] = df['a']
In [166]: df
Out[166]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
However, if your blanks are NaNs, you could use fillna
In [176]: df
Out[176]:
a b
0 1 5.0
1 2 NaN
2 3 6.0
3 4 NaN
In [177]: df['b'] = df['b'].fillna(df['a'])
In [178]: df
Out[178]:
a b
0 1 5.0
1 2 2.0
2 3 6.0
3 4 4.0
You can use np.where to evaluate df.b, if it's not empty keep its value, otherwise use df.a instead.
df.b=np.where(df.b,df.b,df.a)
df
Out[33]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use pd.Series.where using a boolean version of df.b because '' resolve to False
df.assign(b=df.b.where(df.b.astype(bool), df.a))
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use replace and ffill with axis=1:
df.replace('',np.nan).ffill(axis=1).astype(df.a.dtypes)
Output:
a b
0 1 5
1 2 2
2 3 6
3 4 4

Categories