Round up half of the hour in pandas - python

round() function in pandas rounds down the time 07:30 to 07:00 But I want to round up any time which passes the 30 minutes (inclusive).
Eg.
07:15 to 07:00
05:25 to 05:00
22:30 to 23:00
18:45 to 19:00
How to achieve this for a column of a dataframe using pandas?

timestamps
You need to use dt.round. This is however a bit as the previous/next hour behavior depends on the hour itself. You can force it by adding or subtracting a small amount of time (here 1ns):
s = pd.to_datetime(pd.Series(['1/2/2021 3:45', '25/4/2021 12:30',
'25/4/2021 13:30', '12/4/2022 23:45']))
# xx:30 -> rounding depending on the hour parity (default)
s.dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00 <- -30min
2 2021-04-25 14:00:00 <- +30min
3 2022-12-05 00:00:00
dtype: datetime64[ns]
# 00:30 -> 00:00 (force down)
s.sub(pd.Timedelta('1ns')).dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00
2 2021-04-25 13:00:00
3 2022-12-05 00:00:00
dtype: datetime64[ns]
# 00:30 -> 01:00 (force up)
s.add(pd.Timedelta('1ns')).dt.round(freq='1h')
0 2021-01-02 04:00:00
1 2021-04-25 12:00:00
2 2021-04-25 13:00:00
3 2022-12-05 00:00:00
dtype: datetime64[ns]
floats
IIUC, you can use divmod (or numpy.modf) to get the integer and decimal part, then perform simple boolean arithmetic:
s = pd.Series([7.15, 5.25, 22.30, 18.45])
s2, r = s.divmod(1) # or np.modf(s)
s2[r.ge(0.3)] += 1
s2 = s2.astype(int)
Alternative: using mod and boolean to int equivalence:
s2 = s.astype(int)+s.mod(1).ge(0.3)
output:
0 7
1 5
2 23
3 19
dtype: int64
Note on precision. It is not always easy to compare floats due to floating point arithmetics. For instance using gt would fail on the 22.30 here. To ensure precision round to 2 digits first.
s.mod(1).round(2).ge(0.3)
or use integers:
s.mod(1).mul(100).astype(int).ge(30)

Here a version that works with timestamps:
#dummy data:
df = pd.DataFrame({'time':pd.to_datetime([np.random.randint(0,10**8) for a in range(10)], unit='s')})
def custom_round(df, col, out):
if df[col].minute >= 30:
df[out] = df[col].ceil('H')
else:
df[out] = df[col].floor('H')
return df
df.apply(lambda x: custom_round(x, 'time', 'new_time'), axis=1)
#edit:
using numpy:
def custom_round(df, col, out):
df[out] = np.where(
(
df['time'].dt.minute>=30),
df[col].dt.ceil('H'),
df[col].dt.floor('H')
)
return df
df = custom_round(df, 'time', 'new_time')

Related

First week of year considering the first day last year

I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]

Cumulative elapsed minutes from Pandas datetime Series

I have a column of datetime stamps. I need a column of total minutes elapsed from the first to the last value.
I have:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
>>> df
timestamp
0 2001-01-01 06:00:00
1 2001-01-01 06:01:00
2 2001-01-01 06:15:00
I need to add a column that gives the running total:
timestamp minutes
1-1-2001 6:00 0
1-1-2001 6:01 1
1-1-2001 6:15 15
1-1-2001 7:00 60
1-1-2001 7:35 95
Having a hard time manipulating the datetime Series to allow me to total up the timestamp.
I've looked at a lot of posts and can't find anything that does what I'm trying to do. Would appreciate any ideas!
You can chain a few methods together:
>>> df['minutes'] = df['timestamp'].diff().fillna(0).dt.total_seconds()\
... .cumsum().div(60).astype(int)
>>> df
timestamp minutes
0 2001-01-01 06:00:00 0
1 2001-01-01 06:01:00 1
2 2001-01-01 06:15:00 15
Creation:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
Walkthrough
The easiest way to break this down is to separate each intermediate method call.
df['timestamp'].diff() gives you a Series of Pandas-equivalent of Python's datetime.timedelta, the differences in times from each value to the next.
>>> df['timestamp'].diff()
0 NaT
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
This contains an N/A value (NaT/not a time) because there's nothing to subtract from the first value. You can simply fill it with the zero-value for timedeltas:
>>> df['timestamp'].diff().fillna(0)
0 00:00:00
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
Now you need to get an actual integer (minutes) from these objects. In .dt.total_seconds(), .dt is an "accessor" that is a way of accessing a bunch of methods that let you work with datetime-like data:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds()
0 0.0
1 60.0
2 840.0
Name: timestamp, dtype: float64
The result is the incremental second-change as a float. You need this on a cumulative basis, in minutes, and as an integer. That's what the final 3 operations do:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds().cumsum().div(60).astype(int)
0 0
1 1
2 15
Name: timestamp, dtype: int64
Note that astype(int) will do rounding if you have seconds that aren't fully divisible by 60.

pandas: selecting rows in a specific time window

I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.

How to resample yearly starting from 1st of June to 31st may?

How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64

Rounding to nearest min in a pandas dataframe

I want to round the hh:mm:ss values in my TimeStamp col of the dataframe so that the seconds are always 00
Dataset:
TimeStamp A B C
10:27:30 1.953036 2.110234 1.981548
10:28:30 1.973408 2.046361 1.806923
10:29:30 0.000000 0.000000 0.014881
10:30:30 2.567976 3.169928 3.479591
I tried:
df_sr4500.TimeStamp = np.asarray(np.round(df_sr4500.TimeStamp.values / np.timedelta64(1, 'm')), dtype='timedelta64[m]')
I get:
TimeStamp A B C
10:28:00 1.953036 2.110234 1.981548
10:28:00 1.973408 2.046361 1.806923
10:30:00 0.000000 0.000000 0.014881
10:30:00 2.567976 3.169928 3.479591
I would have it rather rounding one way for the middle value.
The next step for me is to resample every two min with mean
Please suggest if you have any better/easier way to do this. I am very new to pandas/numpy.
Add one nanosecond to each time and then round.
df['TimeStamp'] = (df['TimeStamp'] + pd.Timedelta(1)).dt.round('min')
0 10:28:00
1 10:29:00
2 10:30:00
3 10:31:00
Name: TimeStamp, dtype: timedelta64[ns]

Categories