Resampling grouped data to obtain daily average data using Pandas - python

I'm new to pandas and I'm having some problems when I try to obtain daily average from data file.
So, my data is structured as follows:
DATA ESTACION
DATETIME
2020-01-15 00:00:00 175 47
2020-01-15 01:00:00 152 47
2020-01-15 02:00:00 180 47
2020-01-15 03:00:00 132 47
2020-01-15 04:00:00 115 47
... ... ...
2020-03-13 19:00:00 38 16
2020-03-13 20:00:00 53 16
2020-03-13 21:00:00 73 16
2020-03-13 22:00:00 28 16
2020-03-13 23:00:00 22 16
These are air pollution results gathered by 24 stations. Each station receives hourly information as you can see.
I'm trying to get daily average data by station. So this is what I do:
I group all info by station
grouped = data.groupby(['ESTACION'])
Then I get daily average resampling the grouped data
resampled = grouped.resample('D').mean()
And this is what I've obtained:
DATA ESTACION
ESTACION DATETIME
4 2020-01-02 18.250000 4.0
2020-01-03 NaN NaN
2020-01-04 NaN NaN
2020-01-05 NaN NaN
2020-01-06 NaN NaN
... ... ...
60 2020-11-29 NaN NaN
2020-11-30 NaN NaN
2020-12-01 NaN NaN
2020-12-02 118.666667 60.0
2020-12-03 80.833333 60.0
I don't really know whats going on cause I've only got data for 2020-01-15 - 2020-03-13 and it shows me info from other timestamps and NaN results.
If you need anything else to reproduce this case let me know.
Thanks and best regards

Output is expected, because resample always create consecutive DatetimeIndex.
So is possible remove missing rows by DataFrame.dropna:
resampled = grouped.resample('D').mean().dropna()
Another solution is use Series.dt.date:
data.groupby(['ESTACION', data['DATETIME'].dt.date]).mean()

Related

How to fill NANs with a specific row of data

I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN
Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06

Most effective method to get the max value from a column based on a timedelta calculated from the current row

I would like to identify the maximum value in a column that occurs within the following X days from the current date.
This is a subselect of the data frame showing the daily values for 2020.
Date Data
6780 2020-01-02 323.540009
6781 2020-01-03 321.160004
6782 2020-01-06 320.489990
6783 2020-01-07 323.019989
6784 2020-01-08 322.940002
... ... ...
7028 2020-12-24 368.079987
7029 2020-12-28 371.739990
7030 2020-12-29 373.809998
7031 2020-12-30 372.339996
I would like to find a way to identify the max value within the following 30 days. e.g.
Date Data Max
6780 2020-01-02 323.540009 323.019989
6781 2020-01-03 321.160004 323.019989
6782 2020-01-06 320.489990 323.730011
6783 2020-01-07 323.019989 323.540009
6784 2020-01-08 322.940002 325.779999
... ... ... ...
7028 2020-12-24 368.079987 373.809998
7029 2020-12-28 371.739990 373.809998
7030 2020-12-29 373.809998 372.339996
7031 2020-12-30 372.339996 373.100006
I tried calculating the start and end dates and storing them in the columns. e.g.
df['startDate'] = df['Date'] + pd.to_timedelta(1, unit='d')
df['endDate'] = df['Date'] + pd.to_timedelta(30, unit='d')
before trying to calculate the max. e.g,
df['Max'] = df.loc[(df['Date'] > df['startDate']) & (df['Date'] < df['endDate'])]['Data'].max()
But this results in;
Date Data startDate endDate Max
6780 2020-01-02 323.540009 2020-01-03 2020-01-29 NaN
6781 2020-01-03 321.160004 2020-01-04 2020-01-30 NaN
6782 2020-01-06 320.489990 2020-01-07 2020-02-02 NaN
6783 2020-01-07 323.019989 2020-01-08 2020-02-03 NaN
6784 2020-01-08 322.940002 2020-01-09 2020-02-04 NaN
... ... ... ... ... ...
7027 2020-12-23 368.279999 2020-12-24 2021-01-19 NaN
7028 2020-12-24 368.079987 2020-12-25 2021-01-20 NaN
7029 2020-12-28 371.739990 2020-12-29 2021-01-24 NaN
7030 2020-12-29 373.809998 2020-12-31 2021-01-26 NaN
If I statically add dates to the loc[] statement, it partially works, filling the max for this static range however this just gives me the same value for every field.
Any help on the correct panda way to achieve this would be appreciated.
Kind Regards
df.rolling can do this if you make the date a datetime object as the axis:
df["Date"] = pd.to_datetime(df.Date)
df.set_index("Date").rolling("2d").max()
output:
Data
Date
2020-01-02 323.540009
2020-01-03 323.540009
2020-01-06 320.489990
2020-01-07 323.019989
2020-01-08 323.019989

Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)

Python: how to linearly interpolate monthly data?

I'm fairly new to python, especially the data libraries, so please excuse any idiocy.
I'm trying to practise with a made up data set of monthly observations over 12 months, data looks like this...
print(data)
2017-04-17 156
2017-05-09 216
2017-06-11 300
2017-07-29 184
2017-08-31 162
2017-09-24 91
2017-10-15 225
2017-11-03 245
2017-12-26 492
2018-01-26 485
2018-02-18 401
2018-03-09 215
2018-04-30 258
These monthly observations are irregular (there is exactly one in each month but nowhere near the same time).
Now, I want to use liner interpolation to get the values at the start of each month -
I've tried a bunch of methods... and was able to do it 'manually', but I'm trying to get to grips with pandas and numpy, and I know it can be done with these, here's what I had so far: I make a Series holding data, and then I do:
resampled1 = data.resample('MS')
interp1 = resampled1.interpolate()
print(interp1)
This prints:
2017-04-01 NaN
2017-05-01 NaN
2017-06-01 NaN
2017-07-01 NaN
2017-08-01 NaN
2017-09-01 NaN
2017-10-01 NaN
2017-11-01 NaN
2017-12-01 NaN
2018-01-01 NaN
2018-02-01 NaN
2018-03-01 NaN
2018-04-01 NaN
Now, I know that the first one 2017-4-17 should be NaN as linear interpolation (which I believe is the default), interpolates between the two points before and after... which is not possible since I don't have a datapoint before April 1st. As for the others... I'm not certain what I'm doing wrong... probably just because I'm struggling to wrap my head around exactly what resample is doing?
You probably want to resample('D') to interpolate, e.g.:
In []:
data.resample('D').interpolate().asfreq('MS')
Out[]:
2017-05-01 194.181818
2017-06-01 274.545455
2017-07-01 251.666667
2017-08-01 182.000000
2017-09-01 159.041667
2017-10-01 135.666667
2017-11-01 242.894737
2017-12-01 375.490566
2018-01-01 490.645161
2018-02-01 463.086957
2018-03-01 293.315789
2018-04-01 234.019231
Try to use RedBlackPy.
from datetime import datetime
import redblackpy as rb
index = [datetime(2017,4,17), datetime(2017,5,9), datetime(2017,6, 11)]
values = [156, 216, 300]
series = rb.Series(index=index, values=values, interpolate='linear')
# Now you can access by any key with no insertion, using interpolation.
print(series[datetime(2017, 5, 1)]) # prints 194.18182373046875

Group data by time of the day

I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.

Categories