Change starting and ending hour of pandas timestamp - python

I am dealing with a dataset where observations occur between opening and closing hours -- but the service closes on the day after it opens. For example, opening occurs at 7am and closing at 1am, the following day.
This feels like a very common problem -- I've searched around for it and am open to the fact I might just not know the correct terms to search for.
For most of my uses it's enough to do something like:
open_close = pd.DatetimeIndex(start='2012-01-01 05:00:00', periods = 15, offset='D')
Then I can just do fun little groupbys on the df:
df.groupby(open_close.asof).agg(func).
But I've run into an instance where I need to grab multiple of these open-close periods. What I really want to be able to do is just have an DatetimeIndex where I get to pick when an day starts. So I could just redefine 'day' to be from 5AM to 5AM. The nice thing about this is I can then use things like df[df.index.dayofweek == 6] and get back everything from 5AM on Sunday to 5AM on Monda.
It feels like Periods...or something inside of pandas anticipated this request. Would love help figuring it out.
EDIT:
I've also figured this out via creating another column with the right day
df['shift_day'] = df['datetime'].apply(magicFunctionToFigureOutOpenClose)
-- so this isn't blocking my progress. Just feels like something that could be nicely integrated into the package (or datetime...or somewhere...)

Perhaps the base parameter of df.resample() would help:
base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0
Here's an example:
In [44]: df = pd.DataFrame(np.random.rand(28),
....: index=pd.DatetimeIndex(start='2012/9/1', periods=28, freq='H'))
In [45]: df
Out[45]:
0
2012-09-01 00:00:00 0.970273
2012-09-01 01:00:00 0.730171
2012-09-01 02:00:00 0.508588
2012-09-01 03:00:00 0.535351
2012-09-01 04:00:00 0.940255
2012-09-01 05:00:00 0.143483
2012-09-01 06:00:00 0.792659
2012-09-01 07:00:00 0.231413
2012-09-01 08:00:00 0.071676
2012-09-01 09:00:00 0.995202
2012-09-01 10:00:00 0.236551
2012-09-01 11:00:00 0.904853
2012-09-01 12:00:00 0.652873
2012-09-01 13:00:00 0.488400
2012-09-01 14:00:00 0.396647
2012-09-01 15:00:00 0.967261
2012-09-01 16:00:00 0.554188
2012-09-01 17:00:00 0.884086
2012-09-01 18:00:00 0.418577
2012-09-01 19:00:00 0.189584
2012-09-01 20:00:00 0.577041
2012-09-01 21:00:00 0.100332
2012-09-01 22:00:00 0.294672
2012-09-01 23:00:00 0.925425
2012-09-02 00:00:00 0.630807
2012-09-02 01:00:00 0.400261
2012-09-02 02:00:00 0.156469
2012-09-02 03:00:00 0.658608
In [46]: df.resample("24H", how=sum, label='left', closed='left', base=5)
Out[46]:
0
2012-08-31 05:00:00 3.684638
2012-09-01 05:00:00 11.671068
In [47]: df.ix[:5].sum()
Out[47]: 0 3.684638
In [48]: df.ix[5:].sum()
Out[48]: 0 11.671068

Related

Pandas average values for the same hour for each day

The post Pandas Dataframe, getting average value for each monday 1 Am described a similar problem as what I have right now, so I borrowed their dataframe. However, mine is a little bit harder and that solution didn't really work for me.
time value
2016-03-21 00:00:00 0.613014
2016-03-21 01:00:00 0.596383
2016-03-21 02:00:00 0.623570
2016-03-21 03:00:00 0.663350
2016-03-21 04:00:00 0.677817
2016-03-21 05:00:00 0.727116
2016-03-21 06:00:00 0.920279
2016-03-21 07:00:00 1.205863
2016-03-21 08:00:00 0.880946
2016-03-21 09:00:00 0.186947
2016-03-21 10:00:00 -0.563276
2016-03-21 11:00:00 -1.249595
2016-03-21 12:00:00 -1.596035
2016-03-21 13:00:00 -1.886954
2016-03-21 14:00:00 -1.912325
2016-03-21 15:00:00 -1.750623
...
2016-06-20 23:00:00 2.125791
What I need to do is to get average value for each day at a specific time. Say, I have to get the average for every day at 1AM, and then 2AM, and then 3AM.
I would like to do this in a groupby way which might make the rolling avg I have do after that easier, but every method counts, thanks!
You can extract the hours in a separate column and groupby() by it:
df['hour'] = df.time.dt.hour
result_df = df.groupby(['hour']).mean()
You could create a temporary column that holds hours and group on that, eg:
df['hour'] = df['time'].dt.hour
and then
hour_avg = df.groupby(['hour']).mean()
note this approach aggregates hours over all years, months and days.

Change value in pandas series based on hour of the day using df.apply and if statement

I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

How to groupby week starting at a particular time

I have data that I wish to groupby week.
I have been able to do this using the following
Data_Frame.groupby([pd.Grouper(freq='W')]).count()
this creates a dataframe in the form of
2018-01-07 ...
2018-01-14 ...
2018-01-21 ...
which is great. However I need it to start at 06:00, so something like
2018-01-07 06:00:00 ...
2018-01-14 06:00:00 ...
2018-01-21 06:00:00 ...
I am aware that I could shift my data by 6 hours but this seems like a cheat and I'm pretty sure Grouper comes with the functionality to do this (some way of specifying when it should start grouping).
I was hoping someone who know of a good method of doing this.
Many Thanks
edit:
I'm trying to use pythons actual in built functionality more since it often works much better and more consistently. I also turn the data itself into a graph with the timestamps as the y column and I would want the timestamp to actuality reflect the data, without some method such as shifting everything by 6 hours grouping it and then reshifting everything back 6 hours to get the right timestamp .
Use double shift:
np.random.seed(456)
idx = pd.date_range(start = '2018-01-07', end = '2018-01-09', freq = '2H')
df = pd.DataFrame({'a':np.random.randint(10, size=25)}, index=idx)
print (df)
a
2018-01-07 00:00:00 5
2018-01-07 02:00:00 9
2018-01-07 04:00:00 4
2018-01-07 06:00:00 5
2018-01-07 08:00:00 7
2018-01-07 10:00:00 1
2018-01-07 12:00:00 8
2018-01-07 14:00:00 3
2018-01-07 16:00:00 5
2018-01-07 18:00:00 2
2018-01-07 20:00:00 4
2018-01-07 22:00:00 2
2018-01-08 00:00:00 2
2018-01-08 02:00:00 8
2018-01-08 04:00:00 4
2018-01-08 06:00:00 8
2018-01-08 08:00:00 5
2018-01-08 10:00:00 6
2018-01-08 12:00:00 0
2018-01-08 14:00:00 9
2018-01-08 16:00:00 8
2018-01-08 18:00:00 2
2018-01-08 20:00:00 3
2018-01-08 22:00:00 6
2018-01-09 00:00:00 7
#freq='D' for easy check, in original use `W`
df1 = df.shift(-6, freq='H').groupby([pd.Grouper(freq='D')]).count().shift(6, freq='H')
print (df1)
a
2018-01-06 06:00:00 3
2018-01-07 06:00:00 12
2018-01-08 06:00:00 10
So to solve this one needs to use the base parameter for Grouper.
However the caveat is that whatever time period used (years, months, days etc..) for Freq, base will also be in it (from what I can tell).
So as I want to displace the starting position by 6 hours then my freq needs to be in hours rather than weeks (i.e. 1W = 168H).
So the solution I was looking for was
Data_Frame.groupby([pd.Grouper(freq='168H', base = 6)]).count()
This is simple, short, quick and works exactly as I want it to.
Thanks to all the other answers though
I would create another column with the required dates, and groupby on them
import pandas as pd
import numpy as np
selected_datetime = pd.date_range(start = '2018-01-07', end = '2018-01-30', freq = '1H')
df = pd.DataFrame(selected_datetime, columns = ['date'])
df['value1'] = np.random.rand(df.shape[0])
# specify the condition for your date, eg. starting from 6am
df['shift1'] = df['date'].apply(lambda x: x.date() if x.hour == 6 else np.nan)
# forward fill the na values to have last date
df['shift1'] = df['shift1'].fillna(method = 'ffill')
# you can groupby on this col
df.groupby('shift1')['value1'].mean()

python: compare two timestamp in different dates

I have a dataframe, the index is timestamp format with 'YYYY-MM-DD HH:MM:SS'
Now i want to divide this data frame into two parts.
one is the data with time before 12pm('YYYY-MM-DD 12:00:00') everyday
another is the data with time after 12pm for everyday.
I'm just stuck with this question for several days. Any suggestions?
Thank you.
If you have a DatetimeIndex (and if you don't, df.index = pd.to_datetime(df.index) should work to get one), then you can access .hour, e.g. df.index.hour, and select using that:
>>> df.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> morning = df[df.index.hour < 12]
>>> afternoon = df[df.index.hour >= 12]
>>> morning.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> afternoon.head()
A
2015-01-01 12:00:00 12
2015-01-01 13:00:00 13
2015-01-01 14:00:00 14
2015-01-01 15:00:00 15
2015-01-01 16:00:00 16
You could also use groupby, e.g. df.groupby(df.index.hour < 12), but that seems like overkill here. If you wanted a more complex division that might be the way to go, though.

Categories