How to calculate monthly and weekly averages from a dataframe using python? - python

The below is my dataframe. How to calculate both monthly and weekly averages from this dataframe in python? I need to print month start&end and week start&end then the average of the month and week
**Input
SAMPLE DATASET**
kpi_id kpi_name run_date value
1 MTTR 5/17/2021 15
2 MTTR 5/18/2021 16
3 MTTR 5/19/2021 17
4 MTTR 5/20/2021 18
5 MTTR 5/21/2021 19
6 MTTR 5/22/2021 20
7 MTTR 5/23/2021 21
8 MTTR 5/24/2021 22
9 MTTR 5/25/2021 23
10 MTTR 5/26/2021 24
11 MTTR 5/27/2021 25
**expected output**
**monthly_mean**
kpi_name month_start month_end value(mean)
MTTR 5/1/2021 5/31/2021 20
**weekly_mean**
kpi_name week_start week_end value(mean)
MTTR 5/17/2021 5/23/2021 18
MTTR 5/24/2021 5/30/2021 23.5

groupby is your friend
monthly = df.groupby(pd.Grouper(key='run_date', freq='M')).mean()
weekly = df.groupby(pd.Grouper(key='run_date', freq='W')).mean()

Extending the answer from igrolvr to match your expected output:
# transformation from string to datetime for groupby
df['run_date'] = pd.to_datetime(df['run_date'], format='%m/%d/%Y')
# monthly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
monthly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='M')])['value'].mean().reset_index()
# Getting the start of month
monthly['month_start'] = monthly['run_date'] - pd.offsets.MonthBegin(1)
# Renaming the run_date column to month_end
monthly = monthly.rename({'run_date': 'month_end'}, axis=1)
print(monthly)
# weekly
# The groupby() will extract the last date in period,
# reset_index() will make the group by keys to columns
weekly = df.groupby(by=['kpi_name', pd.Grouper(key='run_date', freq='W')])['value'].mean().reset_index()
# Getting the start of week and adding one day,
# because start of week is sunday
weekly['week_start'] = weekly['run_date'] - pd.offsets.Week(1) + pd.offsets.Day(1)
# Renaming the run_date column to week_end
weekly = weekly.rename({'run_date': 'week_end'}, axis=1)
print(weekly)

Related

How to find the number of days in each month between two date in different years

I'm trying to get the number of day between two days but per each month.
I found some answers but I can't figure out how to do it when the dates have two different years.
For example, I have this dataframe:
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2020-01-30' ,'2020-02-02','2020-03-03','2020-03-03','2020-02-02']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
And I want to get this dataframe:
s = (df[["StartDate", "EndDate"]]
.apply(lambda row: pd.date_range(row.StartDate, row.EndDate), axis=1)
.explode())
new = (s.groupby([s.index, s.dt.year, s.dt.month])
.count()
.unstack(level=[1, 2], fill_value=0))
new.columns = new.columns.map(lambda c: f"{c[0]}-{str(c[1]).zfill(2)}")
new = new.sort_index(axis="columns")
get all the dates in between StartDate and EndDate per row, and explode that list of dates to their own rows
group by the row id, year and month & count records
unstack the year & month identifier to be on the columns side as a multiindex
join that year & month values with a hypen in between (also zerofill months, e.g., 03)
lastly sort the year-month pairs on columns
to get
>>> new
2019-11 2019-12 2020-01 2020-02 2020-03
0 0 22 30 0 0
1 0 31 31 2 0
2 0 31 31 29 3
3 21 31 31 29 3
4 9 31 31 2 0

Find sum of values between two dates of a single date column in Pandas dataframe

The dataframe contains date column, revenue column(for specific date) and the name of the day.
This is the code for creating the df:
pd.DataFrame({'Date':['2015-01-08','2015-01-09','2015-01-10','2015-02-10','2015-08-09','2015-08-13','2015-11-09','2015-11-15'],
'Revenue':[15,4,15,13,16,20,12,9],
'Weekday':['Monday','Tuesday','Wednesday','Monday','Friday','Saturday','Monday','Sunday']})
I want to find the sum of revenue between Mondays:
2015-02-10 34 Monday
2015-11-09 49 Monday etc.
First idea is used Weekday for groups by compare by Monday with cumulative sum and aggregate per groups:
df1 = (df.groupby(df['Weekday'].eq('Monday').cumsum())
.agg({'Date':'first','Revenue':'sum', 'Weekday':'first'}))
print (df1)
Date Revenue Weekday
Weekday
1 2015-01-08 34 Monday
2 2015-02-10 49 Monday
3 2015-11-09 21 Monday
But seems not matched Weekday column with Dates in sample data, so DataFrame.resample per weeks starting by Mondays return different output:
df['Date'] = pd.to_datetime(df['Date'])
df2 = df.resample('W-Mon', on='Date').agg({'Revenue':'sum', 'Weekday':'first'}).dropna()
print (df2)
Revenue Weekday
Date
2015-01-12 34 Monday
2015-02-16 13 Monday
2015-08-10 16 Friday
2015-08-17 20 Saturday
2015-11-09 12 Monday
2015-11-16 9 Sunday
First convert your Date column from string to datetime type:
df.Date = pd.to_datetime(df.Date)
Then generate the result:
result = df.groupby(pd.Grouper(key='Date', freq='W-MON', label='left')).Revenue.sum()/
.reset_index()
This result does not contain day of week and in my opinion this is OK,
as they will be all Mondays.
If you want to see only weeks with non-zero result, you can get it as:
result[result.Revenue != 0]
For your source data the result is:
Date Revenue
0 2015-01-05 34
5 2015-02-09 13
30 2015-08-03 16
31 2015-08-10 20
43 2015-11-02 12
44 2015-11-09 9

For loop is only storing the last value in colum

I am trying to pull the week number given a date and then add that week number to the corresponding row in a pandas/python dataframe.
When I run a for loop it is only storing the last calculated value instead of recording each value.
I've tried .append but haven't been able to get anything to work.
import datetime
from datetime import date
for i in df.index:
week_number = date(df.year[i],df.month[i],df.day[i]).isocalendar()
df['week'] = (week_number[1])
Expected values:
day month year week
8 12 2021 49
19 12 2021 50
26 12 2021 51
Values I'm getting:
day month year week
8 12 2021 51
19 12 2021 51
26 12 2021 51
You can simply use Pandas .apply method to make it a one-liner:
df["week"] = df.apply(lambda x: date(x.year, x.month, x.day).isocalendar()[1], axis=1)
You need to assign it back at the correct corresponding, i, position. Using loc, should help:
for i in df.index:
week_number = date(df.year[i],df.month[i],df.day[i]).isocalendar()
df.loc[i,'week'] = (week_number[1])
prints back:
print(df)
day month year week
0 8 12 2021 49
1 19 12 2021 50
2 26 12 2021 51
You can use this, without using apply which is slow:
df['week'] = pd.to_datetime(df['year'].astype(str) + '-' + df['month'].astype(str) + '-' + df['day'].astype(str)).dt.isocalendar().week

Monthly climatology across several years, repeated for each day in that month over all years

I need to find the monthly climatology of some data that has daily values across several years. The code below sufficiently summarizes what I am trying to do. monthly_mean holds the averages over all years for specific months. I then need to assign that average in a new column for each day in a specific month over all of the years. For whatever reason, my assignment, df['A Climatology'] = group['A Climatology'], is only assigning values to the month of December. How can I make the assignment happen for all months?
data = np.random.randint(5,30,size=(365*3,3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'], index=pd.date_range('2021-01-01', periods=365*3))
df['A Climatology'] = np.nan
monthly_mean = df['A'].groupby(df.index.month).mean()
for month, group in df.groupby(df.index.month):
group['A Climatology'] = monthly_mean.loc[month]
df['A Climatology'] = group['A Climatology']
df
Your code is setting the column == to the group, so every iteration of your loop you're setting the df's values only for that group---which is why your df ends on December, the last month in the list.
monthly_mean = df['A'].groupby(df.index.month).mean()
for month, group in df.groupby(df.index.month):
df.loc[lambda df: df.index.month == month, 'A Climatology'] = monthly_mean.loc[month]
Instead, you could directly set the df's values where the month == the iterable month.
merged_df = pd.merge(df,
monthly_mean,
how='left',
left_on=df.index.month,
right_on=monthly_mean.index).drop('key_0', axis=1).set_index(df.index)
A_x B C A Climatology A_y
2021-01-01 12 20 18 NaN 16.752688
2021-01-02 24 26 11 NaN 16.752688
2021-01-03 18 27 15 NaN 16.752688
2021-01-04 18 5 22 NaN 16.752688
2021-01-05 10 15 25 NaN 16.752688
... ... ... ... ... ...
2023-12-27 19 15 11 16.11828 16.118280
2023-12-28 16 23 25 16.11828 16.118280
2023-12-29 6 13 16 16.11828 16.118280
2023-12-30 10 9 14 16.11828 16.118280
2023-12-31 15 22 17 16.11828 16.118280
Or to do this without creating a new data frame:
df = df.reset_index().merge(monthly_mean, how='left', left_on=df.index.month, right_on=monthly_mean.index).set_index('index')
monthly_means:
1 16.752688
2 16.476190
3 16.795699
4 17.111111
5 17.795699
6 18.111111
7 16.806452
8 15.236559
9 15.600000
10 18.279570
11 16.555556
12 16.118280
Name: A, dtype: float64

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

Categories