I have a long table with a datetime and value column. This is a short example of the dataframe. What I currently do is group by hour, weekday, month and I get the mean of the month or hour of all times.
This is for the hourly value: hourly_value = df.groupby([lambda idx: idx.hour]).agg([np.mean, np.std])
datetime value
0 2018-01-01 00:30:00+01:00 0.22
1 2018-01-01 00:35:00+01:00 0.31
2 2018-01-02 00:30:00+01:00 1.15
3 2018-01-02 00:35:00+01:00 1.80
4 2018-01-03 00:30:00+01:00 2.60
5 2018-01-03 00:35:00+01:00 2.30
6 2018-01-04 00:30:00+01:00 1.90
7 2018-01-04 00:35:00+01:00 2.10
8 2018-01-05 00:30:00+01:00 2.90
Now what I want is the hourly value for each day. Monday every hour, Tuesday every hour, Wednesday every hour, ...
Can someone help me with this?:)
You can try:
df.groupby(lambda idx: (idx[1].hour, idx[1].strftime("%A"))).agg([np.mean, np.std])
output:
value
mean std
(0, Friday) 2.900 NaN
(0, Monday) 0.265 0.063640
(0, Thursday) 2.000 0.141421
(0, Tuesday) 1.475 0.459619
(0, Wednesday) 2.450 0.212132
Where the index is (hour, weekday) pair.
But note that e.x. Mondays from different weeks are grouped into one group.
Another way of calculating:
df.resample('1D', on='datetime').agg([np.mean, np.std])
outputs:
value
mean std
datetime
2017-12-31 0.265 0.063640
2018-01-01 1.475 0.459619
2018-01-02 2.450 0.212132
2018-01-03 2.000 0.141421
2018-01-04 2.900 NaN
Related
I have a Pandas dataframe containing hourly precipitation data (tp) between 2013 and 2020, the dataframe is called df:
tp
time
2013-01-01 00:00:00 0.1
2013-01-01 01:00:00 0.1
2013-01-01 02:00:00 0.1
2013-01-01 03:00:00 0.0
2013-01-01 04:00:00 0.2
...
2020-12-31 19:00:00 0.2
2020-12-31 20:00:00 0.1
2020-12-31 21:00:00 0.0
2020-12-31 22:00:00 0.1
2020-12-31 23:00:00 0.0
I'm trying to convert this hourly dataset into monthly totals for each year, I then want to take an average of the monthly summed rainfall so that I end up with a data frame with 12 rows for each month, showing the average summed rainfall over the whole period.
I've tried the resample function:
df.resample('M').mean()
However, this outputs the following and is not what I'm looking to achieve:
tp1
time
2013-01-31 0.121634
2013-02-28 0.318097
2013-03-31 0.356973
2013-04-30 0.518160
2013-05-31 0.055290
...
2020-09-30 0.132713
2020-10-31 0.070817
2020-11-30 0.060525
2020-12-31 0.040002
2021-01-31 0.000000
[97 rows x 1 columns]
While it's converting the hourly data to monthly, I want to show an average of the rainfall across the years.
e.g.
January Column = Average of January rainfall between 2013 and 2020.
Assuming your index is a DatetimeIndex, you can use:
out = df.groupby(df.index.month).mean()
print(out)
# Output
tp1
time
1 0.498262
2 0.502057
3 0.502644
4 0.496880
5 0.499100
6 0.497931
7 0.504981
8 0.497841
9 0.499646
10 0.499804
11 0.506938
12 0.501172
Setup:
import pandas as pd
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2013-01-31', '2021-01-31', freq='H', name='time')
df = pd.DataFrame({'tp1': np.random.random(len(dti))}, index=dti)
print(df)
# Output
tp1
time
2013-01-31 00:00:00 0.009359
2013-01-31 01:00:00 0.499058
2013-01-31 02:00:00 0.113384
2013-01-31 03:00:00 0.049974
2013-01-31 04:00:00 0.685408
... ...
2021-01-30 20:00:00 0.021295
2021-01-30 21:00:00 0.275759
2021-01-30 22:00:00 0.367263
2021-01-30 23:00:00 0.777680
2021-01-31 00:00:00 0.021225
[70129 rows x 1 columns]
I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I am fairly new to python and pandas, so I apologise for any future misunderstandings.
I have a pandas DataFrame with hourly values, looking something like this:
2014-04-01 09:00:00 52.9 41.1 36.3
2014-04-01 10:00:00 56.4 41.6 70.8
2014-04-01 11:00:00 53.3 41.2 49.6
2014-04-01 12:00:00 50.4 39.5 36.6
2014-04-01 13:00:00 51.1 39.2 33.3
2016-11-30 16:00:00 16.0 13.5 36.6
2016-11-30 17:00:00 19.6 17.4 44.3
Now I need to calculate 24h average values for each column starting from 2014-04-01 12:00 to 2014-04-02 11:00
So I want daily averages from noon to noon.
Unfortunately, I have no idea how to do that. I have read some suggestions to use groupby, but I don't really know how...
Thank you very much in advance! Any help is appreciated!!
For newer versions of pandas (>= 1.1.0) use the offset argument:
df.resample('24H', offset='12H').mean()
The base argument.
A day is 24 hours, so a base of 12 would start the grouping from Noon - Noon. Resample gives you all days in between, so you could .dropna(how='all') if you don't need the complete basis. (I assume you have a DatetimeIndex, if not you can use the on argument of resample to specify your datetime column.)
df.resample('24H', base=12).mean()
#df.groupby(pd.Grouper(level=0, base=12, freq='24H')).mean() # Equivalent
1 2 3
0
2014-03-31 12:00:00 54.20 41.30 52.233333
2014-04-01 12:00:00 50.75 39.35 34.950000
2014-04-02 12:00:00 NaN NaN NaN
2014-04-03 12:00:00 NaN NaN NaN
2014-04-04 12:00:00 NaN NaN NaN
... ... ... ...
2016-11-26 12:00:00 NaN NaN NaN
2016-11-27 12:00:00 NaN NaN NaN
2016-11-28 12:00:00 NaN NaN NaN
2016-11-29 12:00:00 NaN NaN NaN
2016-11-30 12:00:00 17.80 15.45 40.450000
You could subtract your time and groupby:
df.groupby((df.index - pd.to_timedelta('12:00:00')).normalize()).mean()
You can shift the hours by 12 hours and resample on day level.
from io import StringIO
import pandas as pd
data = """
2014-04-01 09:00:00,52.9,41.1,36.3
2014-04-01 10:00:00,56.4,41.6,70.8
2014-04-01 11:00:00,53.3,41.2,49.6
2014-04-01 12:00:00,50.4,39.5,36.6
2014-04-01 13:00:00,51.1,39.2,33.3
2016-11-30 16:00:00,16.0,13.5,36.6
2016-11-30 17:00:00,19.6,17.4,44.3
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, index_col=0)
df.index = pd.to_datetime(df.index)
# shift by 12 hours
df.index = df.index - pd.Timedelta(hours=12)
# resample and drop na rows
df.resample('D').mean().dropna()
one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713