Pandas - remove rows based on two conditions - python

I have a pandas dataframe like this -
ColA ColB ColC
Apple 2019-03-02 18:00:00 Saturday
Orange 2019-03-03 10:00:00 Sunday
Mango 2019-03-04 09:00:00 Monday
I am trying to remove rows from my dateframe based on certain conditions.
Remove the row if the datetime is 9AM and above and 5PM and below.
Do not remove this, if it is weekend (Saturday and Sunday).
Expected output will not have Mango in the dataframe.

Seems it is harder than what I thought
s1=df.ColB.dt.hour.between(9,17,inclusive=False)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday
Or using
s1=pd.Index(df.ColB).indexer_between_time('09:00:00','17:00:00',include_start =False ,include_end =False)
s1=df.index.isin(s1)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]

To give another alternative you could write it like this:
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)]
Full example:
import pandas as pd
df = pd.DataFrame({
'ColA': ['Apple','Orange','Mango'],
'ColB': pd.to_datetime([
'2019-03-02 18:00:00',
'2019-03-03 10:00:00',
'2019-03-04 09:00:00'
]),
'ColC': ['Saturday', 'Sunday', 'Monday']
})
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)] # conditions mark the rows to drop, hence ~
print(df)
Returns:
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday

Related

Pandas Dataframe datetime condition

I have the following dataframe and would like to create a new column based on a condition. The new column should contain 'Night' for the hours between 20:00 and 06:00, 'Morning' for the time between 06:00 and 14:30 and 'Afternoon' for the time between 14:30 and 20:00. How can I formulate and apply such a condition best way?
import pandas as pd
df = {'A' : ['test', '2222', '1111', '3333', '1111'],
'B' : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
'Date' : ['15.07.2018 06:23:56', '15.07.2018 01:23:56', '15.07.2018 06:40:06', '15.07.2018 11:38:27', '15.07.2018 21:38:27'],
'Defect': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'])
You can use np.select:
from datetime import time
condlist = [df['Date'].dt.time.between(time(6), time(14, 30)),
df['Date'].dt.time.between(time(14,30), time(20))]
df['Time'] = np.select(condlist, ['Morning', 'Afternoon'], default='Night')
Output:
>>> df
A B Date Defect Time
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night
Note, you don't need the condition for 'Night':
df['Date'].dt.time.between(time(20), time(23,59,59)) \
| df['Date'].dt.time.between(time(0), time(6))
because np.select can take a default value as argument.
You can create an index of the date field and then use indexer_between_time.
idx = pd.DatetimeIndex(df["Date"])
conditions = [
("20:00", "06:00", "Night"),
("06:00", "14:30", "Morning"),
("14:30", "20:00", "Afternoon"),
]
for cond in conditions:
start, end, val = cond
df.loc[idx.indexer_between_time(start, end, include_end=False), "Time_of_Day"] = val
A B Date Defect Time_of_Day
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night

How to replace timestamp across the columns using pandas

df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12

How to trim outliers in dates in python?

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?
You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Categories