Keep the conditional calculation result in a dataframe - python

My data frame dft2022 is:
Start_Time End_Time
9:55:00 10:55:00
5:41:00 14:55:00
9:01:00 12:55:00
9:02:00 7:55:00
8:55:00 N/A
11:55:00 N/A
I want to add a duration column in this dataframe, the duration = End_Time - Start_Time. I used the following:
dft2022["duration"] = pd.to_datetime(dft2022["End_Time"]) - pd.to_datetime(dft2022["Start_Time"])
However, I only want to keep the "duration" value where End_Time - Start_Time is positive value or doesn't have a N/A in End_Time.

The output that you expect is unclear, bit you could use clip to set the negative deltas to 0:
df['duration'] = (pd.to_datetime(dft2022["End_Time"])
.sub(pd.to_datetime(dft2022["Start_Time"]))
.clip(lower='0')
)
Output:
Start_Time End_Time duration
0 9:55:00 10:55:00 0 days 01:00:00
1 5:41:00 14:55:00 0 days 09:14:00
2 9:01:00 12:55:00 0 days 03:54:00
3 9:02:00 7:55:00 0 days 00:00:00
4 8:55:00 NaN NaT
5 11:55:00 NaN NaT
To filter the date, you can use:
df[pd.to_datetime(dft2022["End_Time"])
.sub(pd.to_datetime(dft2022["Start_Time"]))
.gt('0')]
Output:
Start_Time End_Time
0 9:55:00 10:55:00
1 5:41:00 14:55:00
2 9:01:00 12:55:00

Can try something like this, needs Numpy.
import numpy as np
import pandas as pd
data = {'Start_Date': ['9:55:00', '5:41:00', '9:01:00', '9:02:00', '8:55:00', '11:55:00'], 'End_Date': ['10:55:00', '14:55:00', '12:55:00', '7:55:00', 'N/A', 'N/A']}
df = pd.DataFrame(data)
df['_end_time'] = pd.to_datetime(df.End_Date, errors='coerce')
df['_start_time'] = pd.to_datetime(df.Start_Date, errors='coerce')
# Coerce sets NaT if unable to parse as datetime.
df["duration"] = np.where(
df._end_time >= df._start_time,
df._end_time - df._start_time,
pd.NaT
)
df.drop(columns=['_start_time', '_end_time'], inplace=True)
print(df)
Output:
Start_Date End_Date duration
0 9:55:00 10:55:00 3600000000000
1 5:41:00 14:55:00 33240000000000
2 9:01:00 12:55:00 14040000000000
3 9:02:00 7:55:00 NaT
4 8:55:00 N/A NaT
5 11:55:00 N/A NaT

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Out of different set of dates I want to check if all set of dates are contiguous

if I have 2 different set of dates:
01/05/2022 - 31/12/2022
01/01/2023 - 31/12/2023
01/05/2022 - 30/09/2022
01/10/2022 - 31/12/2022
01/01/2023 - 31/12/2023
I want to check if both set of dates above are contiguous between below range of dates
Date 1 = 01/05/2022
Date 2 = 31/12/2023
Please suggest a solution.
It seems to me easier to use pandas to check if dates fall into the date range.
You have the data day, month, year. In my practice, I usually see the sequences year, month, day.
I changed the variables 'Date_1', 'Date_2' to the desired format and the arrays themselves with dates, which I divided into two parts from and to. Then I filled the dataframe with these arrays and checked the date range. I specifically added one line with data for clarity: 2023-01-01 2025-12-31, it is just filtered, since it does not fall under the condition.
import pandas as pd
from datetime import datetime
Date_1 = '01/05/2022'
Date_2 = '31/12/2023'
Date_1 = datetime.strptime(Date_1, "%d/%m/%Y")
Date_2 = datetime.strptime(Date_2, "%d/%m/%Y")
start = [datetime.strptime(i, "%d/%m/%Y")for i in ['01/05/2022', '01/01/2023', '01/05/2022', '01/10/2022', '01/01/2023', '01/01/2023']]
finish = [datetime.strptime(i, "%d/%m/%Y")for i in ['31/12/2022', '31/12/2023', '30/09/2022', '31/12/2022', '31/12/2023', '31/12/2025']]
df = pd.DataFrame({'start': start, 'finish': finish})
print(df)
print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
Output print(df)
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
5 2023-01-01 2025-12-31
Output print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31

First week of year considering the first day last year

I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]

How to trim outliers in dates in python?

I have a dataframe df:
0 2003-01-02
1 2015-10-31
2 2015-11-01
16 2015-11-02
33 2015-11-03
44 2015-11-04
and I want to trim the outliers in the dates. So in this example I want to delete the row with the date 2003-01-02. Or in bigger data frames I want to delete the dates who do not lie in the interval where 95% or 99% lie. Is there a function who can do this ?
You could use quantile() on Series or DataFrame.
dates = [datetime.date(2003,1,2),
datetime.date(2015,10,31),
datetime.date(2015,11,1),
datetime.date(2015,11,2),
datetime.date(2015,11,3),
datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)
qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%
print(qa, qb)
#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)
The output is:
DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
Assuming you have your column converted to datetime format:
import pandas as pd
import datetime as dt
df = pd.DataFrame(data)
df = pd.to_datetime(df[0])
you can do:
include = df[df.dt.year > 2003]
print(include)
[out]:
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
Name: 0, dtype: datetime64[ns]
Have a look here
... regarding to your answer (it's basically the same idea,... be creative my friend):
s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)
my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

finding time slots present between start time and end time in python

We have csv file containing predefined time slots.
According to start time and end time provided by the user we want time slots present between the start time and end time.
eg
start time =11:00:00
end time=19:00:00
output- slot_no 2,3,4,5
I think you need boolean indexing with loc and between for selecting column Slot_no, all columns and values are converted to_timedelta, also midnight is replaced to 24:00:00:
df = pd.DataFrame(
{'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:01','12:01:00','14:01:00','18:01:01','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00']})
df = df.reindex_axis(['Slot_no','start_time','end_time'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
print (df)
Slot_no start_time end_time
0 1 00:01:00 0 days 08:00:00
1 2 08:01:00 0 days 10:00:00
2 3 10:01:01 0 days 12:00:00
3 4 12:01:00 0 days 14:00:00
4 5 14:01:00 0 days 18:00:00
5 6 18:01:01 0 days 20:00:00
6 7 20:01:00 1 days 00:00:00
start = pd.to_timedelta('11:00:00')
end = pd.to_timedelta('19:00:00')
mask = df['start_time'].between(start, end) | df['end_time'].between(start, end)
s = df.loc[mask, 'Slot_no']
print (s)
2 3
3 4
4 5
5 6
Name: Slot_no, dtype: int64
L = df.loc[mask, 'Slot_no'].tolist()
print (L)
[3, 4, 5, 6]

Categories