I have a dataframe with a column 'queue_ist_dt'. This column contains pandas._libs.tslibs.timestamps.Timestamp values. My requirement is :
if time = 10:13:00 then round_off_time = 10:00:00
if time = 23:29:00 then round_off_time = 23:00:00
and so on.
if time = 10:31:00 then round_off_time = 10:30:00
if time = 23:53:00 then round_off_time = 23:30:00
and so on.
if time = 10:30:00 then round_off_time = 10:30:00
These are the 3 conditions.
I tried to write the following logic :
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******NEED HELP TO BUILD THIS LOGIC******
except:
pass
Need help to build logic for the time where minutes is greater than 30 mins and have to be rounded down to 30 mins.
Use Series.dt.floor:
#if necessary convert to datetimes
df['queue_ist_dt'] = pd.to_datetime(df['queue_ist_dt'].astype(str))
df['queue_ist_dt1'] = df['queue_ist_dt'].dt.floor('30Min').dt.time
print (df)
Logic is subtract 30 minute from timedelta
code is as below:
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******THIS LOGIC******
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- (timedelta-30)
except:
pass
Let me know if this helps you😊
Considering this dataframe df as example
df = pd.DataFrame({'queue_ist_dt': [pd.Timestamp('2021-01-01 10:00:00'),
pd.Timestamp('2021-01-01 10:30:00'),
pd.Timestamp('2021-01-01 11:00:00'),
pd.Timestamp('2021-01-01 11:30:00'),
pd.Timestamp('2021-01-01 23:00:00'),
pd.Timestamp('2021-01-01 23:30:00'),
pd.Timestamp('2021-01-01 23:30:00')]
})
[Out]:
queue_ist_dt
0 2021-01-01 10:01:00
1 2021-01-01 10:35:00
2 2021-01-01 11:19:00
3 2021-01-01 11:33:00
4 2021-01-01 23:23:00
5 2021-01-01 23:22:00
6 2021-01-01 23:55:00
One way would be to use pandas.Series.dt.round as follows
df['round_off_time'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt round_off_time
0 2021-01-01 10:01:00 2021-01-01 10:00:00
1 2021-01-01 10:35:00 2021-01-01 10:30:00
2 2021-01-01 11:19:00 2021-01-01 11:30:00
3 2021-01-01 11:33:00 2021-01-01 11:30:00
4 2021-01-01 23:23:00 2021-01-01 23:30:00
5 2021-01-01 23:22:00 2021-01-01 23:30:00
6 2021-01-01 23:55:00 2021-01-02 00:00:00
If the goal is to change the values in the column queue_ist_dt, do the following
df['queue_ist_dt'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt
0 2021-01-01 10:00:00
1 2021-01-01 10:30:00
2 2021-01-01 11:30:00
3 2021-01-01 11:30:00
4 2021-01-01 23:30:00
5 2021-01-01 23:30:00
6 2021-01-02 00:00:00
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
I'm trying to use pd.cut to divide 24 hours into the following interval:
[6,11),[11,14),[14,17),[17,22),[22,6)
How could I achieve the last bin [22,6)?
Assuming some form of datetime column, try offsetting the datetime by 6 hours so that the lower bound becomes midnight. Then cutting based on those hours instead, with the custom labels:
import pandas as pd
# sample data
df = pd.DataFrame({
'datetime': pd.date_range('2021-01-01', periods=24, freq='H')
})
df['bins'] = pd.cut((df['datetime'] - pd.Timedelta(hours=6)).dt.hour,
bins=[0, 5, 8, 11, 16, 24],
labels=['[6,11)', '[11,14)', '[14,17)',
'[17,22)', '[22,6)'],
right=False)
df:
datetime bins
0 2021-01-01 00:00:00 [22,6)
1 2021-01-01 01:00:00 [22,6)
2 2021-01-01 02:00:00 [22,6)
3 2021-01-01 03:00:00 [22,6)
4 2021-01-01 04:00:00 [22,6)
5 2021-01-01 05:00:00 [22,6)
6 2021-01-01 06:00:00 [6,11)
7 2021-01-01 07:00:00 [6,11)
8 2021-01-01 08:00:00 [6,11)
9 2021-01-01 09:00:00 [6,11)
10 2021-01-01 10:00:00 [6,11)
11 2021-01-01 11:00:00 [11,14)
12 2021-01-01 12:00:00 [11,14)
13 2021-01-01 13:00:00 [11,14)
14 2021-01-01 14:00:00 [14,17)
15 2021-01-01 15:00:00 [14,17)
16 2021-01-01 16:00:00 [14,17)
17 2021-01-01 17:00:00 [17,22)
18 2021-01-01 18:00:00 [17,22)
19 2021-01-01 19:00:00 [17,22)
20 2021-01-01 20:00:00 [17,22)
21 2021-01-01 21:00:00 [17,22)
22 2021-01-01 22:00:00 [22,6)
23 2021-01-01 23:00:00 [22,6)
Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64
I am looking for the following functionality in python:
I have a Pandas DataFrame with 4 columns: ID, StartDate, EndDate, Moment.
I want to group by ID and evaluate per row in the group whether the Moment variable falls between the interval between StartDate and EndDate. The problem is that I want to evaluate this for each row in the group. For example in the following DataFrame there are two groups (ID=1 and ID=2) and both groups contains of 5 rows. For each row, I want a boolean for each row in both groups whether the moment variable in that row falls in ANY of the time windows in the group, the window being [date1, date2].
import pandas as pd
i = pd.date_range('2018-04-11', periods=10, freq='2D20min')
i2 = pd.date_range('2018-04-12', periods=10, freq='2D20min')
i3 = pd.date_range('2018-04-9', periods=10, freq='1D6H')
id = ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2']
ts = pd.DataFrame({'date1': i, 'date2': i2, 'moment': i3}, index=id)
ID date1 date2 moment
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00
In this case, the value for moment in the first row of the first group does not fall in any of the five time intervals. Neither does the second. The third value, 2018-04-11 12:00:00 does fall in the interval in the first row and I would thus want to have True returned.
The desired result would look as follows:
ID date1 date2 moment result
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00 False
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00 False
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00 True
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00 False
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00 True
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00 False
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00 False
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00 False
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00 False
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00 False
EDIT
I already 'solved' this problem with the following approach but am looking for a more pythonic and perhaps faster way...
boolean_result = []
for c in ts.index.unique():
temp = ts.loc[ts.index == c]
for row in temp.index:
current_date = temp['moment'][row]
boolean_result.append(max((temp['date1'] <= current_date)
& (current_date <= temp['date2'])))
ts['Result'] = boolean_result
This may actually be very slow if your dataframe is too big, and there might be an optimal solution other than this one:
def time_in_range(start, end, x):
"""Return true if x is in the range [start, end]"""
if start <= x and x <= end:
return True
else:
return False
# empty list to be appended
result = []
test_list = []
for i in ts.index.unique():
temp_df = ts[ts.index == i]
for j in range(0, len(temp_df)):
for k in range(0, len(temp_df)):
test_list.append(time_in_range(temp_df.date1.iloc[k], temp_df.date2.iloc[k], temp_df.moment.iloc[j]))
result.append(any(test_list))
# reset the list
test_list = []
ts['result'] = result
I have raw data like this want to find the difference between this two time in mint .....problem is data which is in data frame...
source:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
Need a output like this:
duration
540mint
798mint
162mint
1140mint
420mint
Your expected output seems to be incorrect. That aside, we can use base R's difftime:
transform(
df,
duration = difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
# start.time end.time duration
#0 08:30:00 17:30:00 540 mins
#1 11:00:00 17:30:00 390 mins
#2 08:00:00 21:30:00 810 mins
#3 19:30:00 22:00:00 150 mins
#4 19:00:00 00:00:00 -1140 mins
#5 08:30:00 15:30:00 420 mins
or as a difftime vector
with(df, difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
#Time differences in mins
#[1] 540 390 810 150 -1140 420
Sample data
df <- read.table(text =
" 'start time' 'end time'
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00", header = T, row.names = 1)
import pandas as pd
df = pd.DataFrame({'start time':['08:30:00','11:00:00','08:00:00','19:30:00','19:00:00','08:30:00'],'end time':['17:30:00','17:30:00','21:30:00','22:00:00','00:00:00','15:30:00']},columns=['start time','end time'])
df
Out[355]:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
(pd.to_datetime(df['end time']) - pd.to_datetime(df['start time'])).dt.seconds/60
Out[356]:
0 540.0
1 390.0
2 810.0
3 150.0
4 300.0
5 420.0
dtype: float64
Yes, definitely datetime is what you need here. Specifically, the strptime function, which parses a string into a time object.
from datetime import datetime
s1 = '10:33:26'
s2 = '11:15:49' # for example
FMT = '%H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
That gets you a timedelta object that contains the difference between the two times. You can do whatever you want with that, e.g. converting it to seconds or adding it to another datetime.
This will return a negative result if the end time is earlier than the start time, for example s1 = 12:00:00 and s2 = 05:00:00. If you want the code to assume the interval crosses midnight in this case (i.e. it should assume the end time is never earlier than the start time), you can add the following lines to the above code:
if tdelta.days < 0:
tdelta = timedelta(days=0,
seconds=tdelta.seconds, microseconds=tdelta.microseconds)
(of course you need to include from datetime import timedelta somewhere). Thanks to J.F. Sebastian for pointing out this use case.