Drop duplicate rows from DataFrame based on conditions on multiple columns - python

I have dataframe as follow:
id
value
date
001
True
01/01/2022 00:00:00
002
False
03/01/2022 00:00:00
003
True
03/01/2022 00:00:00
001
False
01/01/2022 01:30:00
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:
If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)
Expected output:
id
value
date
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?
Thanks.

It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:
output = (
df.sort_values(by=['date','value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)
print(output)
Output
id value date
4 1 True 2022-01-01 01:30:00
5 2 True 2022-03-01 00:00:00
6 3 True 2022-03-01 00:30:00
7 4 False 2022-03-01 00:30:00
8 5 False 2022-01-01 00:00:00

Related

Why are there different results for pandas groupby+resample on an appended dataframe

I want to groupby and resample a dataframe i have. I group by int_var and bool_var, and then I resample per 1Min to fill in any missing minutes in the dataset. This works perfectly fine for the base dataframe A:
date bool_var int_var
2021-01-01 00:03:00 True 1
2021-01-01 00:06:00 False 6
2021-01-01 00:06:00 True 6
The result then becomes something like this:
int_var bool_var date
1 True 2021-01-01 00:03:00 1
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 0
6 True 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
6 False 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
This is exactly what I want. However, as you can see the data starts a bit after midnight, and I want those minutes from midnight to be in there as well. So I append a row for each bool_var / int_var combination at 2021-01-01 00:00:00, to make sure the resampling starts from there.
rows = []
some for loop:
rows.append()
extra_rows_df = pd.DataFrame(rows, columns=['date', 'bool_var', 'int_var'])
B = pd.concat([A, extra_rows_df], ignore_index=True)
The resulting dataframe B appear to be correct, and in the same format as dataframe A:
date bool_var int_var
2021-01-01 00:00:00 True 1
2021-01-01 00:03:00 True 1
2021-01-01 00:00:00 False 6
2021-01-01 00:06:00 False 6
2021-01-01 00:00:00 True 6
2021-01-01 00:06:00 True 6
However, if I run the exact same groupby and resample command on dataframe B. My results are all weird:
date 2021-01-01 00:00:00 ... 2021-12-31 23:59:00
int_var bool_var 1 ... 1
1 True
6 True
False
It is like each date suddenly became a column instead of being listed for each grouping.
TL;DR: use stack().
I figured it out. In dataframe A, every bool_var / int_var group has different datetime values; here (1, True) started with 00:03, but some other group, e.g. (2, True) could start with an entry at 01:14. Once I filled out dataframe A so that each group had an entry at 00:00 in dataframe B, and I resampled to fill in each minute, every group had each datetime. In this way, all those datetimes could become columns since they apply to each group.
The solution is to use stack() on this final result

How To Find A Closest Date To a Given Date?

I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31
Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31

Fill pandas column using values from list

This is my list:
my_list = [
2002-01-11 22:15:00,
2002-02-12 10:30:00,
2002-03-14 02:30:00,
2002-04-12 22:15:00
]
I have DataFrame:
dt_object diff
0 2002-01-01 00:00:00 -160.95041
1 2002-01-01 00:15:00 -160.81016
2 2002-01-01 00:30:00 -160.66989
3 2002-01-01 00:45:00 -160.52961
4 2002-01-01 01:00:00 -160.38930
I want to create new column 'Hit' with False value by default and True value when dates from list match.
Expected output:
dt_object diff hit
0 2002-01-01 00:00:00 -160.95041 False
1 2002-01-01 00:15:00 -160.81016 False
2 2002-01-01 00:30:00 -160.66989 False
3 2002-01-01 00:45:00 -160.52961 False
4 2002-01-01 01:00:00 -160.38930 False
....................
....................
1010 2002-01-11 22:15:00 -150.54678 True
because 2002-01-11 22:15:00 is in list.
you can do:
import numpy as np
df['hit'] = np.where(df['dt_object'].isin(my_list),1,0)) # will give 1 or 0 according if the condition is satisfied.
To just get back True or False, just remove the returning part.
df['hit'] = df['dt_object'].isin(my_list)
Use Series.isin
df['hit'] = df['dt_object'].isin(my_list)

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

how can i delete whole day rows on condition column values.. pandas

i have below times series data frames
i wanna delete rows on condtion (check everyday) : check aaa>100 then delete all day rows (in belows, delete all 2015-12-01 rows because aaa column last 3 have 1000 value)
....
date time aaa
2015-12-01,00:00:00,0
2015-12-01,00:15:00,0
2015-12-01,00:30:00,0
2015-12-01,00:45:00,0
2015-12-01,01:00:00,0
2015-12-01,01:15:00,0
2015-12-01,01:30:00,0
2015-12-01,01:45:00,0
2015-12-01,02:00:00,0
2015-12-01,02:15:00,0
2015-12-01,02:30:00,0
2015-12-01,02:45:00,0
2015-12-01,03:00:00,0
2015-12-01,03:15:00,0
2015-12-01,03:30:00,0
2015-12-01,03:45:00,0
2015-12-01,04:00:00,0
2015-12-01,04:15:00,0
2015-12-01,04:30:00,0
2015-12-01,04:45:00,0
2015-12-01,05:00:00,0
2015-12-01,05:15:00,0
2015-12-01,05:30:00,0
2015-12-01,05:45:00,0
2015-12-01,06:00:00,0
2015-12-01,06:15:00,0
2015-12-01,06:30:00,1000
2015-12-01,06:45:00,1000
2015-12-01,07:00:00,1000
....
how can i do it ?
I think you need if MultiIndex first compare values of aaa by condition and then filter all values in first level by boolean indexing, last filter again by isin with inverted condition by ~:
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
2015-12-02 05:00:00 0
05:15:00 200
05:30:00 0
05:45:00 0
2015-12-03 06:00:00 0
06:15:00 0
06:30:00 1000
06:45:00 1000
07:00:00 1000
lvl0 = df.index.get_level_values(0)
idx = lvl0[df['aaa'].gt(100)].unique()
print (idx)
Index(['2015-12-02', '2015-12-03'], dtype='object', name='date')
df = df[~lvl0.isin(idx)]
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
And if first column is not index only compare column date:
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
4 2015-12-02 05:00:00 0
5 2015-12-02 05:15:00 200
6 2015-12-02 05:30:00 0
7 2015-12-02 05:45:00 0
8 2015-12-03 06:00:00 0
9 2015-12-03 06:15:00 0
10 2015-12-03 06:30:00 1000
11 2015-12-03 06:45:00 1000
12 2015-12-03 07:00:00 1000
idx = df.loc[df['aaa'].gt(100), 'date'].unique()
print (idx)
['2015-12-02' '2015-12-03']
df = df[~df['date'].isin(idx)]
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0

Categories