Averaging values of dates in dataframe - python

I have the following dataframe
data = pd.DataFrame({
'date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'value': [11558522, 12323552, 13770958, 18412280, 13770958]
})
Is there a way in python that I can average a value for a whole month and make that the new value for that month
i.e. I want to average the 1988-01 value and make that the final value for 1988-01. I tried the groupby method but that didnt work
new_df=data.groupby(['date']).mean()

Use month periods created by Series.dt.to_period:
data['date'] = pd.to_datetime(data['date'])
new_df=data.groupby(data['date'].dt.to_period('m')).mean()
print (new_df)
value
date
1988-01 1.255101e+07
1989-01 1.841228e+07
1990-01 1.377096e+07
Or use DataFrame.resample and if necessary remove missing values:
new_df=data.resample('MS', on='date').mean().dropna()
print (new_df)
value
date
1988-01-01 1.255101e+07
1989-01-01 1.841228e+07
1990-01-01 1.377096e+07
Or you can use months and years separately for MultiIndex:
new_df=data.groupby([data['date'].dt.year.rename('y'),
data['date'].dt.month.rename('m')]).mean()
print (new_df)
value
y m
1988 1 1.255101e+07
1989 1 1.841228e+07
1990 1 1.377096e+07

df=pd.read_csv("data .csv",encoding='ISO-8859-1', parse_dates=["datetime"])
print(df)
print(df.dtypes)
datetime Temperature
0 1987-11-01 07:00:00 21.4
1 1987-11-01 13:00:00 27.4
2 1987-11-01 19:00:00 25.0
3 1987-11-02 07:00:00 22.0
4 1987-11-02 13:00:00 27.6
... ...
27554 2020-03-30 13:00:00 24.8
27555 2020-03-30 18:00:00 23.8
27556 2020-03-31 07:00:00 23.4
27557 2020-03-31 13:00:00 24.6
27558 2020-03-31 18:00:00 26.4
df1=df.groupby(pd.Grouper(key='datetime',freq='D')).mean()
datetime Temperature
1987-11-01 24.600000
1987-11-02 25.066667
1987-11-03 24.466667
1987-11-04 22.533333
1987-11-05 25.066667
...
2020-03-27 26.533333
2020-03-28 27.666667
2020-03-29 27.733333
2020-03-30 24.266667
2020-03-31 24.800000

Related

Transform an hourly dataframe into a monthly totalled dataframe in Python

I have a Pandas dataframe containing hourly precipitation data (tp) between 2013 and 2020, the dataframe is called df:
tp
time
2013-01-01 00:00:00 0.1
2013-01-01 01:00:00 0.1
2013-01-01 02:00:00 0.1
2013-01-01 03:00:00 0.0
2013-01-01 04:00:00 0.2
...
2020-12-31 19:00:00 0.2
2020-12-31 20:00:00 0.1
2020-12-31 21:00:00 0.0
2020-12-31 22:00:00 0.1
2020-12-31 23:00:00 0.0
I'm trying to convert this hourly dataset into monthly totals for each year, I then want to take an average of the monthly summed rainfall so that I end up with a data frame with 12 rows for each month, showing the average summed rainfall over the whole period.
I've tried the resample function:
df.resample('M').mean()
However, this outputs the following and is not what I'm looking to achieve:
tp1
time
2013-01-31 0.121634
2013-02-28 0.318097
2013-03-31 0.356973
2013-04-30 0.518160
2013-05-31 0.055290
...
2020-09-30 0.132713
2020-10-31 0.070817
2020-11-30 0.060525
2020-12-31 0.040002
2021-01-31 0.000000
[97 rows x 1 columns]
While it's converting the hourly data to monthly, I want to show an average of the rainfall across the years.
e.g.
January Column = Average of January rainfall between 2013 and 2020.
Assuming your index is a DatetimeIndex, you can use:
out = df.groupby(df.index.month).mean()
print(out)
# Output
tp1
time
1 0.498262
2 0.502057
3 0.502644
4 0.496880
5 0.499100
6 0.497931
7 0.504981
8 0.497841
9 0.499646
10 0.499804
11 0.506938
12 0.501172
Setup:
import pandas as pd
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2013-01-31', '2021-01-31', freq='H', name='time')
df = pd.DataFrame({'tp1': np.random.random(len(dti))}, index=dti)
print(df)
# Output
tp1
time
2013-01-31 00:00:00 0.009359
2013-01-31 01:00:00 0.499058
2013-01-31 02:00:00 0.113384
2013-01-31 03:00:00 0.049974
2013-01-31 04:00:00 0.685408
... ...
2021-01-30 20:00:00 0.021295
2021-01-30 21:00:00 0.275759
2021-01-30 22:00:00 0.367263
2021-01-30 23:00:00 0.777680
2021-01-31 00:00:00 0.021225
[70129 rows x 1 columns]

Change value if it repeats a certain number of times in a month

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN.
Imagine that we have this other dataframe with the number of times the value is repeated per month:
date values
0 2013-01 200
1 2013-02 0
2 2013-03 2
3 2013-04 181
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
In this case, the month of January and April passed the stipulated value and that first dataframe should be:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.
You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:
import numpy as np
MISSING = -9999
THRESHOLD = 175
# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
df.groupby('month')['values']
.transform(lambda d: (d == MISSING).sum())
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
(df['values'] == MISSING) & (df['n_missing'] > THRESHOLD),
np.nan,
df['values']
)

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

Adding missing time stamp rows to a df in pandas

I have very unusual time series data which is both irregular and has several missing values.
The data points are measured 3 times a day only on weekdays, at 10:00AM, 2:00PM, and 6:00PM, most days are missing one or two measurements, and some days are missing altogether.
My df looks something like this:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-31 10:00:00 6
3 2020-07-31 14:00:00 4.5
4 2020-07-31 18:00:00 7
5 2020-08-03 14:00:00 5.5
6 2020-08-04 14:00:00 5
I'm trying to figure out how to fill it out with the time stamps for the missing measurements, add a row with the missing time stamp and an NA value, but without adding extra times of day or any Saturdays or Sundays, so that my df looks like this at the end:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-30 18:00:00 NA
3 2020-07-31 10:00:00 6
4 2020-07-31 14:00:00 4.5
5 2020-07-31 18:00:00 7
6 2020-08-03 10:00:00 NA
7 2020-08-03 14:00:00 5.5
8 2020-08-03 18:00:00 NA
9 2020-08-04 10:00:00 NA
10 2020-08-04 14:00:00 5
11 2020-08-04 18:00:00 NA
The only thing I could come up with was pretty convoluted: write a loop to generate a row for all the dates in the desired date range * 3 (1 for each time of measurement) formatted as date times, along with a an additional week of day counter. Convert it into a df, and then drop all columns where Week of Day = 6,7, then do a join of this new df with my original df on the date time column (Outer or left - whichever one keeps all columns).
Is there any more elegant way of doing this?
you could create a filtered date range and index by it:
all_ts = pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(), freq='H')
weekday_ts = all_ts[~all_ts.weekday.isin([5,6])]
filtered_ts = weekday_ts[weekday_ts.hour.isin([10, 14, 18])]
df.set_index(df['datetime']).reindex(filtered_ts).drop('datetime', axis=1).reset_index()
df = pd.DataFrame([
{"date time": datetime.datetime.strptime("2020-07-30 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
{"date time": datetime.datetime.strptime("2020-07-30 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 3},
{"date time": datetime.datetime.strptime("2020-07-31 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 6},
{"date time": datetime.datetime.strptime("2020-07-31 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 4.5},
{"date time": datetime.datetime.strptime("2020-07-31 18:00:00", '%Y-%m-%d %H:%M:%S'), "value": 7},
{"date time": datetime.datetime.strptime("2020-08-02 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5.5},
{"date time": datetime.datetime.strptime("2020-08-03 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
]
)
# define your range of dates you're working with
range_dates = pd.date_range('2020-07-30', '2020-08-04', freq='D')
# remove weekend days
range_dates = range_dates[~range_dates.weekday.isin([5,6])]
range_dates = pd.Series(range_dates)
# here we will create a range of your 3 hours of measurements
range_times = pd.date_range('10:00:00', '18:00:00', freq='4H')
range_times = pd.Series(range_times.time)
# we combine our two ranges
index = range_dates.apply(
lambda date: range_times.apply(
lambda time: datetime.datetime.combine(date, time)
)
).unstack()
# we reindex the dataframe and sort it
df = df.reindex(index=index).sort_index()
Output:
value
2020-07-30 10:00:00 5.0
2020-07-30 14:00:00 3.0
2020-07-30 18:00:00 NaN
2020-07-31 10:00:00 6.0
2020-07-31 14:00:00 4.5
2020-07-31 18:00:00 7.0
2020-08-01 10:00:00 NaN
2020-08-01 14:00:00 NaN
2020-08-01 18:00:00 NaN
2020-08-02 10:00:00 NaN
2020-08-02 14:00:00 5.5
2020-08-02 18:00:00 NaN
2020-08-03 10:00:00 NaN
2020-08-03 14:00:00 5.0
2020-08-03 18:00:00 NaN
2020-08-04 10:00:00 NaN
2020-08-04 14:00:00 NaN
2020-08-04 18:00:00 NaN

How to calculate daily averages from noon to noon with pandas?

I am fairly new to python and pandas, so I apologise for any future misunderstandings.
I have a pandas DataFrame with hourly values, looking something like this:
2014-04-01 09:00:00 52.9 41.1 36.3
2014-04-01 10:00:00 56.4 41.6 70.8
2014-04-01 11:00:00 53.3 41.2 49.6
2014-04-01 12:00:00 50.4 39.5 36.6
2014-04-01 13:00:00 51.1 39.2 33.3
2016-11-30 16:00:00 16.0 13.5 36.6
2016-11-30 17:00:00 19.6 17.4 44.3
Now I need to calculate 24h average values for each column starting from 2014-04-01 12:00 to 2014-04-02 11:00
So I want daily averages from noon to noon.
Unfortunately, I have no idea how to do that. I have read some suggestions to use groupby, but I don't really know how...
Thank you very much in advance! Any help is appreciated!!
For newer versions of pandas (>= 1.1.0) use the offset argument:
df.resample('24H', offset='12H').mean()
The base argument.
A day is 24 hours, so a base of 12 would start the grouping from Noon - Noon. Resample gives you all days in between, so you could .dropna(how='all') if you don't need the complete basis. (I assume you have a DatetimeIndex, if not you can use the on argument of resample to specify your datetime column.)
df.resample('24H', base=12).mean()
#df.groupby(pd.Grouper(level=0, base=12, freq='24H')).mean() # Equivalent
1 2 3
0
2014-03-31 12:00:00 54.20 41.30 52.233333
2014-04-01 12:00:00 50.75 39.35 34.950000
2014-04-02 12:00:00 NaN NaN NaN
2014-04-03 12:00:00 NaN NaN NaN
2014-04-04 12:00:00 NaN NaN NaN
... ... ... ...
2016-11-26 12:00:00 NaN NaN NaN
2016-11-27 12:00:00 NaN NaN NaN
2016-11-28 12:00:00 NaN NaN NaN
2016-11-29 12:00:00 NaN NaN NaN
2016-11-30 12:00:00 17.80 15.45 40.450000
You could subtract your time and groupby:
df.groupby((df.index - pd.to_timedelta('12:00:00')).normalize()).mean()
You can shift the hours by 12 hours and resample on day level.
from io import StringIO
import pandas as pd
data = """
2014-04-01 09:00:00,52.9,41.1,36.3
2014-04-01 10:00:00,56.4,41.6,70.8
2014-04-01 11:00:00,53.3,41.2,49.6
2014-04-01 12:00:00,50.4,39.5,36.6
2014-04-01 13:00:00,51.1,39.2,33.3
2016-11-30 16:00:00,16.0,13.5,36.6
2016-11-30 17:00:00,19.6,17.4,44.3
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, index_col=0)
df.index = pd.to_datetime(df.index)
# shift by 12 hours
df.index = df.index - pd.Timedelta(hours=12)
# resample and drop na rows
df.resample('D').mean().dropna()

Categories