Pandas average values for the same hour for each day - python

The post Pandas Dataframe, getting average value for each monday 1 Am described a similar problem as what I have right now, so I borrowed their dataframe. However, mine is a little bit harder and that solution didn't really work for me.
time value
2016-03-21 00:00:00 0.613014
2016-03-21 01:00:00 0.596383
2016-03-21 02:00:00 0.623570
2016-03-21 03:00:00 0.663350
2016-03-21 04:00:00 0.677817
2016-03-21 05:00:00 0.727116
2016-03-21 06:00:00 0.920279
2016-03-21 07:00:00 1.205863
2016-03-21 08:00:00 0.880946
2016-03-21 09:00:00 0.186947
2016-03-21 10:00:00 -0.563276
2016-03-21 11:00:00 -1.249595
2016-03-21 12:00:00 -1.596035
2016-03-21 13:00:00 -1.886954
2016-03-21 14:00:00 -1.912325
2016-03-21 15:00:00 -1.750623
...
2016-06-20 23:00:00 2.125791
What I need to do is to get average value for each day at a specific time. Say, I have to get the average for every day at 1AM, and then 2AM, and then 3AM.
I would like to do this in a groupby way which might make the rolling avg I have do after that easier, but every method counts, thanks!

You can extract the hours in a separate column and groupby() by it:
df['hour'] = df.time.dt.hour
result_df = df.groupby(['hour']).mean()

You could create a temporary column that holds hours and group on that, eg:
df['hour'] = df['time'].dt.hour
and then
hour_avg = df.groupby(['hour']).mean()
note this approach aggregates hours over all years, months and days.

Related

Remove the first day of a dataframe

I have the following dataframe:
X
Datetime
2017-01-01 01:00:00 3129.3460
2017-01-01 02:00:00 5433.4315
2017-01-01 03:00:00 2351.8391
2017-01-01 04:00:00 6788.3210
2017-01-01 05:00:00 1232.8655
...
2022-08-14 20:00:00 8905.5340
2022-08-14 21:00:00 8623.0765
2022-08-14 22:00:00 9054.8312
2022-08-14 23:00:00 10341.4785
2022-08-15 00:00:00 9341.1234
How can i remove the whole day of data, if the first hour of that day is different from zero? In this case, i need to remove the whole 2017-01-01 day.
So i thought of using an if condition with df.drop()
first_data = df.index.min()
if first_data.day != 0:
df = df.drop(first_data)
But i am not sure what i should pass as argument to df.drop. In the above code, it will only drop the first hour of that first day, since first_data gives me the whole timestamp from years until seconds.
Let's try groupby and filter
out = (df.groupby(df.index.date)
.filter(lambda g: str(g.index.min().time()) == '00:00:00'))
print(out)
X
2022-08-15 00:00:00 9341.1234
2022-08-15 01:00:00 9341.1234

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

Deleting rows between multiple sets of time stamps

I have a DataFrame that has time stamps in the form of (yyyy-mm-dd hh:mm:ss). I'm trying to delete data between two different time stamps. At the moment I can delete the data between 1 range of time stamps but I have trouble extending this to multiple time stamps.
For example, with the DataFrame I can delete a range of rows (e.g. 2015-03-01 00:20:00 to 2015-08-01 01:10:00) however, I'm not sure how to go about deleting another range alongside it. The code that does that is shown below.
index_list= df.timestamp[(df.timestamp >= "2015-07-01 00:00:00") & (df.timestamp <= "2015-12-30 23:50:00")].index.tolist()
df1.drop(df1.index[index_list1, inplace = True)
The DataFrame extends over 3 years and has every day in the 3 years included.
I'm trying to delete all the rows from months July to December (2015-07-01 00:00:00 to 2015-12-30 23:50:00) for all 3 years.
I was thinking that I create a helper column that gets the Month from the Date column and then drops based off the Month from the helper column.
I would greatly appreciate any advice. Thanks!
Edit:
I've added in a small summarised version of the DataFrame. This is what the intial DataFrame looks like.
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-04-01 00:30:00 65.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-07-01 01:00:00 74.0
2015-08-01 01:10:00 54.0
2015-09-01 01:20:00 86.0
2015-10-01 01:30:00 91.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
To get something like this
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
Where time stamps "2015-07-01 00:20:00 to 2015-10-01 00:30:00"and "2015-07-01 01:00:00 to 2015-10-01 01:30:00" are removed. Sorry if my formatting isn't up to standard.
If your timestamp column uses the correct dtype, you can just do:
df.loc[df.timestamp.dt.month.isin([1, 2, 3, 5, 6, 11, 12])]
This should filter out the months not inside the list.
As you hinted, data manipulation is always easier when you use the right data types. To support time stamps, pandas has the Timestamp type. You can do this as follows:
df['Date'] = pd.to_datetime(df['Date']) # No date format needs to be specified,
# "YYYY-MM-DD HH:MM:SS" is the standard
Then, removing all entries in the months of July to December for all years is straightforward:
df = df[df['Date'].dt.month < 7] # Keep only months less than July

Looping through dates in python

I have previously only worked in Stata but am now trying to switch to python. I want to conduct an event study. More specifically, I have 4 fixed dates a year. Every first day of every quarter, e.g. 1st January, 1st April...., and an event window +- 10 days around the event date. In order to partition my sample to the desired window I am using the following command:
smpl = merged.ix[datetime.date(year=2013,month=12,day=21):datetime.date(year=2014,month=1,day=10)]
I want to write a loop that automatically shifts the choosen sample period 90 days forward in every run of the loop so that I can subsequently run the required analysis in that step. I know how to run the analysis, but I do not know how to shift the sample 90 days forward for every step in the loop. For example, the next sample in the loop should be:
smpl = merged.ix[datetime.date(year=2014,month=3,day=21):datetime.date(year=2014,month=4,day=10)]
Its probably pretty simple, something like month=I and then shift by +3 every month. I am just to much of a noob in python to get the syntax done.
Any help is greatly appreciated.
I'd use this:
for beg in pd.date_range('2013-12-21', '2017-05-17', freq='90D'):
smpl = merged.loc[beg:beg + pd.Timedelta('20D')]
...
Demo:
In [158]: for beg in pd.date_range('2013-12-21', '2017-05-17', freq='90D'):
...: print(beg, beg + pd.Timedelta('20D'))
...:
2013-12-21 00:00:00 2014-01-10 00:00:00
2014-03-21 00:00:00 2014-04-10 00:00:00
2014-06-19 00:00:00 2014-07-09 00:00:00
2014-09-17 00:00:00 2014-10-07 00:00:00
2014-12-16 00:00:00 2015-01-05 00:00:00
2015-03-16 00:00:00 2015-04-05 00:00:00
2015-06-14 00:00:00 2015-07-04 00:00:00
2015-09-12 00:00:00 2015-10-02 00:00:00
2015-12-11 00:00:00 2015-12-31 00:00:00
2016-03-10 00:00:00 2016-03-30 00:00:00
2016-06-08 00:00:00 2016-06-28 00:00:00
2016-09-06 00:00:00 2016-09-26 00:00:00
2016-12-05 00:00:00 2016-12-25 00:00:00
2017-03-05 00:00:00 2017-03-25 00:00:00

Change starting and ending hour of pandas timestamp

I am dealing with a dataset where observations occur between opening and closing hours -- but the service closes on the day after it opens. For example, opening occurs at 7am and closing at 1am, the following day.
This feels like a very common problem -- I've searched around for it and am open to the fact I might just not know the correct terms to search for.
For most of my uses it's enough to do something like:
open_close = pd.DatetimeIndex(start='2012-01-01 05:00:00', periods = 15, offset='D')
Then I can just do fun little groupbys on the df:
df.groupby(open_close.asof).agg(func).
But I've run into an instance where I need to grab multiple of these open-close periods. What I really want to be able to do is just have an DatetimeIndex where I get to pick when an day starts. So I could just redefine 'day' to be from 5AM to 5AM. The nice thing about this is I can then use things like df[df.index.dayofweek == 6] and get back everything from 5AM on Sunday to 5AM on Monda.
It feels like Periods...or something inside of pandas anticipated this request. Would love help figuring it out.
EDIT:
I've also figured this out via creating another column with the right day
df['shift_day'] = df['datetime'].apply(magicFunctionToFigureOutOpenClose)
-- so this isn't blocking my progress. Just feels like something that could be nicely integrated into the package (or datetime...or somewhere...)
Perhaps the base parameter of df.resample() would help:
base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0
Here's an example:
In [44]: df = pd.DataFrame(np.random.rand(28),
....: index=pd.DatetimeIndex(start='2012/9/1', periods=28, freq='H'))
In [45]: df
Out[45]:
0
2012-09-01 00:00:00 0.970273
2012-09-01 01:00:00 0.730171
2012-09-01 02:00:00 0.508588
2012-09-01 03:00:00 0.535351
2012-09-01 04:00:00 0.940255
2012-09-01 05:00:00 0.143483
2012-09-01 06:00:00 0.792659
2012-09-01 07:00:00 0.231413
2012-09-01 08:00:00 0.071676
2012-09-01 09:00:00 0.995202
2012-09-01 10:00:00 0.236551
2012-09-01 11:00:00 0.904853
2012-09-01 12:00:00 0.652873
2012-09-01 13:00:00 0.488400
2012-09-01 14:00:00 0.396647
2012-09-01 15:00:00 0.967261
2012-09-01 16:00:00 0.554188
2012-09-01 17:00:00 0.884086
2012-09-01 18:00:00 0.418577
2012-09-01 19:00:00 0.189584
2012-09-01 20:00:00 0.577041
2012-09-01 21:00:00 0.100332
2012-09-01 22:00:00 0.294672
2012-09-01 23:00:00 0.925425
2012-09-02 00:00:00 0.630807
2012-09-02 01:00:00 0.400261
2012-09-02 02:00:00 0.156469
2012-09-02 03:00:00 0.658608
In [46]: df.resample("24H", how=sum, label='left', closed='left', base=5)
Out[46]:
0
2012-08-31 05:00:00 3.684638
2012-09-01 05:00:00 11.671068
In [47]: df.ix[:5].sum()
Out[47]: 0 3.684638
In [48]: df.ix[5:].sum()
Out[48]: 0 11.671068

Categories