Compute daily climatology using pandas python - python

I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?

You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64

Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index

#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64

Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.

Related

How to slice a pandas DataFrame between two dates (day/month) ignoring the year?

I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1

pandas groupby to return dates in a dataset

Could someone give me a tip on how to use pandas groupby to find similar "days" in a time series dataset?
For example my data is (averaged daily values) a buildings electrical power and weather data, I am attempting to see if Pandas groupby can be used to find similar "days" both in electrical power usage and weather to a unique date in the time stamp of July 25th 2019.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Data/master/stackoverflow_groupby_question.csv', parse_dates=True)
df['Date']=pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df_daily_avg = df.resample('D').mean()
What I am trying to find is like the top 10 or 15 most similar days in this dataset to the averaged temperature on that day of July 25th which is:
july_25_temp_avg = df_daily_avg.loc['2019-07-25'].Temperature_C
22.047916666666676
And averaged building power which is:
july_25_power_avg = df_daily_avg.loc['2019-07-25'].kW
52.658333333333324
If I use groupby, something like this below it strips away the time stamp index.
july25_most_similar = df_daily_avg.groupby(['kW','Temperature_C'],as_index=False).Temperature_C.mean()
returns where it seems like most similar days are on the bottom:
kW Temperature_C
0 9.316667 17.256250
1 9.433333 14.979167
2 9.616667 13.933333
3 9.683333 19.822917
4 10.116667 24.606250
... ... ...
360 58.741667 21.816667
361 61.250000 23.839583
362 61.633333 25.204167
363 62.483333 25.970833
364 63.808333 25.300000
Any tips greatly appreciated to return the timestamp/days that are most similar to July 25th Temperature & Power.
Also if it is possible to use more criteria than just Temperature_C is it possible to post an additional answer to use more weather data? For example the averaged power on July 25th and more weather data (beyond just Temperature_C) like Wind_Speed_m_s Relative_Humidity Temperature_C Pressure_mbar DHI_DNI?
I think I would take this approach:
indx = df_daily_avg.sub(df_daily_avg.loc['2019-07-25']).abs()\
.sort_values(['Temperature_C', 'kW']).head(10).index.normalize()
df[df.index.normalize().isin(indx)]
Use diff and take the abs get the top then days sorted on 'Temperature_C' and 'kW' or some sort of metric that ranks the two.
Then get those index normalize them to a date and determine which rows in the original dataframe match retreived index.

How to aggregate irregularly sampled data for Time Series Analysis

I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28

How to use resample count to screen original data

Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Categories