I'm working with a dataframe that has daily information (measured data) across 30 years for different variables. I am trying to groupby days of the year, and then find a mean across 30 years. How do I go about this? This is what the dataframe looks like
I tried to groupby day after checking for type of YYYYMMDD (it's an int64 type.) now I have the dataframe looking like this. It has just added new columns for Day, Month year
[]
I'm a bit stuck on how to calculate means from here, i would need to somehow group all Jan-1sts, jan-2nds etc over 30 years and average it after.
You can groupby with month and day:
df.index = pd.to_datetime(df.index)
( df.groupby([df.index.month, df.index.day]).mean().reset_index().
rename({'level_0':'month', 'level_1':'day'}, axis=1))
or if you want to group them by the day of year, i.e. 1, 2, .. 365, set as_index=False:
df.groupby([df.index.month, df.index.day], as_index=False).mean()
Related
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I have a dataframe and these are the first 5 index, there are several rows with different datapoint for a date and then it goes to the next day
DatetimeIndex(['2014-01-01', '2014-01-01', '2014-01-01', '2014-01-01',
'2014-01-01'],
dtype='datetime64[ns]', name='DayStartedOn', freq=None)
and this is the current column dtypes
country object
type object
name object
injection float64
withdrawal float64
cy_month period[M]
I wish to add a column with calendar year month, and 2 columns with different fiscal years and months.
better to separate year and month in different columns like: calendar year, calendar month, fiscal year, fiscal month. The objective is to keep these column values when I perform regroup or resample with other columns
I achieved above cy_month by
df['cy_month']=df.index.to_period('M')
even I don't feel very comfortable with this, as I want the period, not the monthend
I tried to add these 2 columns
for calendar year:
pd.Period(df_storage_clean.index.year, freq='A-DEC')
for another fiscal year:
pd.Period(df_storage_clean.index.year, freq='A-SEP')
but had Traceback:
ValueError: Value must be Period, string, integer, or datetime
So I started to NOT using pandas by loop row by row and add to a list,
lst_period_cy=[]
for y in lst_cy:
period_cy=pd.Period(y, freq='A-DEC')
lst_period_cy.append(period_cy)
then convert the list to a Series or df and add it back to the df
but I suppose it's not efficient (150k rows data) so haven't continued
Just in case you haven't found a solution yet ...
You could do the following:
df.reset_index(drop=False, inplace=True)
df['cal_year_month'] = df.DayStartedOn.dt.month
df['cal_year'] = df.DayStartedOn.dt.year
df['fisc_year'] = df.DayStartedOn.apply(pd.Period, freq='A-SEP')
df.set_index('DayStartedOn', drop=True, inplace=True)
My assumption is that, as in your example, the index is named DayStartedOn. If that's not the case then the code has to be adjusted accordingly.
Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.
I have a dataframe that has one column which is a datetime series object. It has some data associated with every date in another column. The year ranges from 2005-2014. I want to group similar dates in each year together, i.e, all the 1st January falling in 2005-15 must be grouped together irrespective of the year.Similarly for all the 365 days in a year. So I should have 365 days as the output. How can I do that?
Assuming your DataFrame has a column Date, you can make it the index of the DataFrame and then use strftime, to convert to a format with only day and month (like "%m-%d"), and groupby plus the appropriate function (I just used mean):
df = df.set_index('Date')
df.index = df.index.strftime("%m-%d")
dfAggregated = df.groupby(level=0).mean()
Please note that the output will have 366 days, due to leap years. You might want to filter out the data associated to Feb 29th or merge it into Feb 28th/March 1st (depending on the specific use case of your application)
I have looked at the resample/Timegrouper functionality in Pandas. However, I'm trying to figure out how to use it for this specific case. I want to do a seasonal analysis across a financial asset - let's say S&P 500. I want to know how the asset performs between any two custom dates on average across many years.
Example: If I have a 10 year history of daily changes of S&P 500 and I pick the date range between March 13th and March 23rd, then I want to know the average change for each date in my range across the last 10 years - i.e. average change on 3/13 each year for the last 10 years, and then for 3/14, 3/15 and so on until 3/23. This means I need to groupby month and day and do an average of values across different years.
I can probably do this by creating 3 different columns for year, month, and day and then grouping by two of them, but I wonder if there are more elegant ways of doing this.
I figured it out. It turned out to be pretty simple and I was just being dumb.
x.groupby([x.index.month, x.index.day], as_index=True).mean()
where x is a pandas series in my case (but I suppose could also be a dataframe?). This will return a multi-index series which is ok in my case, but if it's not in your case then you can manipulate it to drop a level or turn the index into new columns