How to group sum along timestamp in pandas? - python

one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *

You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713

Related

Read DataFrame from a text file where datestamp might have float numbers

I have 2 types of text files which I need to read into a pandas DataFrame. I have a problem with the datetimes and separator.
File A:
2009,7,1,3,101,13.03,89.33,0.6,287.69,0
2009,7,1,6,102,19.3,55,1,288.67,0
2009,7,1,9,103,22.33,39.67,1,289.6,0
2009,7,1,12,104,21.97,41,1,295.68,0
File B:
2019 9 1 3.00 101 14.02 92.08 2.62 174.77 0.109
2019 9 1 6.00 102 13.79 92.86 2.79 179.29 0.046
2019 9 1 9.00 103 13.81 92.60 2.73 178.94 0.070
2019 9 1 12.00 104 13.31 95.20 2.91 179.38 0.015
fileA.txt has no extra spaces, fileB.txt has 1 extra space on the beginning of each line. I can read each of these as follows and the results are correct:
>>> import pandas as pd
>>> from datetime import datetime as dtdt
>>> par3 = lambda x: dtdt.strptime(x, '%Y %m %d %H')
>>> par4 = lambda x: dtdt.strptime(x, '%Y %m %d %H.%M')
>>> df3=pd.read_csv('fileA.txt',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par3, index_col='Date')
>>> df3
4 5 6 7 8 9
Date
2009-07-01 03:00:00 101 13.03 89.33 0.6 287.69 0
2009-07-01 06:00:00 102 19.30 55.00 1.0 288.67 0
2009-07-01 09:00:00 103 22.33 39.67 1.0 289.60 0
2009-07-01 12:00:00 104 21.97 41.00 1.0 295.68 0
>>> dg3=pd.read_csv('fileB.txt',sep='\s+',engine='python',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par4, index_col='Date')
>>> dg3
4 5 6 7 8 9
Date
2019-09-01 03:00:00 101 14.02 92.08 2.62 174.77 0.109
2019-09-01 06:00:00 102 13.79 92.86 2.79 179.29 0.046
2019-09-01 09:00:00 103 13.81 92.60 2.73 178.94 0.070
2019-09-01 12:00:00 104 13.31 95.20 2.91 179.38 0.015
Question: how do I read in both of these types with the same command? The only way I can think of is to first open the file and read in first line to deduce the hour-format (col.3) and the separator? That feels non-pythonic way.
Also, if the hour-reading float is e.g. 3.75, it would be ok to round that into nearest integer and just set the minute reading to 0.

Transform an hourly dataframe into a monthly totalled dataframe in Python

I have a Pandas dataframe containing hourly precipitation data (tp) between 2013 and 2020, the dataframe is called df:
tp
time
2013-01-01 00:00:00 0.1
2013-01-01 01:00:00 0.1
2013-01-01 02:00:00 0.1
2013-01-01 03:00:00 0.0
2013-01-01 04:00:00 0.2
...
2020-12-31 19:00:00 0.2
2020-12-31 20:00:00 0.1
2020-12-31 21:00:00 0.0
2020-12-31 22:00:00 0.1
2020-12-31 23:00:00 0.0
I'm trying to convert this hourly dataset into monthly totals for each year, I then want to take an average of the monthly summed rainfall so that I end up with a data frame with 12 rows for each month, showing the average summed rainfall over the whole period.
I've tried the resample function:
df.resample('M').mean()
However, this outputs the following and is not what I'm looking to achieve:
tp1
time
2013-01-31 0.121634
2013-02-28 0.318097
2013-03-31 0.356973
2013-04-30 0.518160
2013-05-31 0.055290
...
2020-09-30 0.132713
2020-10-31 0.070817
2020-11-30 0.060525
2020-12-31 0.040002
2021-01-31 0.000000
[97 rows x 1 columns]
While it's converting the hourly data to monthly, I want to show an average of the rainfall across the years.
e.g.
January Column = Average of January rainfall between 2013 and 2020.
Assuming your index is a DatetimeIndex, you can use:
out = df.groupby(df.index.month).mean()
print(out)
# Output
tp1
time
1 0.498262
2 0.502057
3 0.502644
4 0.496880
5 0.499100
6 0.497931
7 0.504981
8 0.497841
9 0.499646
10 0.499804
11 0.506938
12 0.501172
Setup:
import pandas as pd
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2013-01-31', '2021-01-31', freq='H', name='time')
df = pd.DataFrame({'tp1': np.random.random(len(dti))}, index=dti)
print(df)
# Output
tp1
time
2013-01-31 00:00:00 0.009359
2013-01-31 01:00:00 0.499058
2013-01-31 02:00:00 0.113384
2013-01-31 03:00:00 0.049974
2013-01-31 04:00:00 0.685408
... ...
2021-01-30 20:00:00 0.021295
2021-01-30 21:00:00 0.275759
2021-01-30 22:00:00 0.367263
2021-01-30 23:00:00 0.777680
2021-01-31 00:00:00 0.021225
[70129 rows x 1 columns]

Getting the max value and the time the max value occurs for all periods in a pandas df

I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00

How to group by hourly for each day of the week?

I have a long table with a datetime and value column. This is a short example of the dataframe. What I currently do is group by hour, weekday, month and I get the mean of the month or hour of all times.
This is for the hourly value: hourly_value = df.groupby([lambda idx: idx.hour]).agg([np.mean, np.std])
datetime value
0 2018-01-01 00:30:00+01:00 0.22
1 2018-01-01 00:35:00+01:00 0.31
2 2018-01-02 00:30:00+01:00 1.15
3 2018-01-02 00:35:00+01:00 1.80
4 2018-01-03 00:30:00+01:00 2.60
5 2018-01-03 00:35:00+01:00 2.30
6 2018-01-04 00:30:00+01:00 1.90
7 2018-01-04 00:35:00+01:00 2.10
8 2018-01-05 00:30:00+01:00 2.90
Now what I want is the hourly value for each day. Monday every hour, Tuesday every hour, Wednesday every hour, ...
Can someone help me with this?:)
You can try:
df.groupby(lambda idx: (idx[1].hour, idx[1].strftime("%A"))).agg([np.mean, np.std])
output:
value
mean std
(0, Friday) 2.900 NaN
(0, Monday) 0.265 0.063640
(0, Thursday) 2.000 0.141421
(0, Tuesday) 1.475 0.459619
(0, Wednesday) 2.450 0.212132
Where the index is (hour, weekday) pair.
But note that e.x. Mondays from different weeks are grouped into one group.
Another way of calculating:
df.resample('1D', on='datetime').agg([np.mean, np.std])
outputs:
value
mean std
datetime
2017-12-31 0.265 0.063640
2018-01-01 1.475 0.459619
2018-01-02 2.450 0.212132
2018-01-03 2.000 0.141421
2018-01-04 2.900 NaN

pandas asfreq() function is not showing the results

I am trying to find the gaps in a time series( 30 min interval) and fill them with NaN.
I am referring to the instructions on this post but it turns out that the code is not performing well in my case.
Find gaps in pandas time series dataframe sampled at 1 minute intervals and fill the gaps with new rows
import pandas as pd
from datetime import datetime
df=pd.read_csv('example1.csv')
df.head(5)
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0
df['Date & Time'] = pd.to_datetime(df['Date & Time'], format='%m/%d/%Y %H:%M')
df=df.set_index('Date & Time').asfreq('30M')
df.head()
KW
Date & Time
As you can see the output is blank, I am expected to see values.
I am not sure what I am doing wrong.
M is month and your 'Date & Time' column is decreasing. You need to use T and negative value
Try on this sample
df:
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0.000
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0.000
5 8/27/2019 19:30 0.000
df.set_index('Date & Time').asfreq('-30T')
Out[412]:
KW
Date & Time
2019-08-27 23:30:00 0.016
2019-08-27 23:00:00 0.000
2019-08-27 22:30:00 0.016
2019-08-27 22:00:00 0.016
2019-08-27 21:30:00 0.000
2019-08-27 21:00:00 NaN
2019-08-27 20:30:00 NaN
2019-08-27 20:00:00 NaN
2019-08-27 19:30:00 0.000

Categories