Given this dataframe df:
date type target
2021-01-01 0 5
2021-01-01 0 6
2021-01-01 1 4
2021-01-01 1 2
2021-01-02 0 5
2021-01-02 1 3
2021-01-02 1 7
2021-01-02 0 1
2021-01-03 0 2
2021-01-03 1 5
I want to create a new column that contains yesterday's target mean by type.
For example, for the 5th row (date=2021-01-02, type=0) the new column's value would be 5.5, as the mean of the target for the previous day, 2021-01-01 for type=0 is (5+6)/2.
I can easily obtain the mean of target grouping by date and type as:
means = df.groupby(['date', 'type'])['target'].mean()
But I don't know how to create a new column on the original dataframe with the desired data, which should look as follows:
date type target mean
2021-01-01 0 5 NaN (or null or whatever)
2021-01-01 0 6 NaN
2021-01-01 1 4 NaN
2021-01-01 1 2 NaN
2021-01-02 0 5 5.5
2021-01-02 1 3 3
2021-01-02 1 7 3
2021-01-02 0 2 5.5
2021-01-03 0 2 3.5
2021-01-03 1 5 5
Ensure your date column is datetime, and add another temporary column to df of the date the day before:
df['date'] = pd.to_datetime(df['date'])
df['yesterday'] = df['date'] - pd.Timedelta('1 day')
Then use your means groupby, with as_index=False, and left merge that onto the original df on yesterday/date and type columns, and select the desired columns:
means = df.groupby(['date', 'type'], as_index=False)['target'].mean()
df.merge(means, left_on=['yesterday', 'type'], right_on=['date', 'type'],
how='left', suffixes=[None, ' mean'])[['date', 'type', 'target', 'target mean']]
Output:
date type target target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Idea is add one day to first level of MultiIndex Series by Timedelta, so possible add new column by DataFrame.join:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby(['date', 'type'])['target'].mean()
s2 = s1.rename(index=lambda x: x + pd.Timedelta(days=1), level=0)
df = df.join(s2.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Another solution:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby([df['date'] + pd.Timedelta(days=1), 'type'])['target'].mean()
df = df.join(s1.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
A small edition on #Emi OB' s answer
means = df.groupby(["date", "type"], as_index=False)["target"].mean()
means["mean"] = means.pop("target").shift(2)
df = df.merge(means, how="left", on=["date", "type"])
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 2 5.5
8 2021-01-03 0 2 3.5
9 2021-01-03 1 5 5.0
Related
I wanna add a day to all cells of this dataframe:
value B N S date
date
2020-12-31 1 11 0 2020-12-31
2021-01-01 3 80 0 2021-01-01
2021-01-02 4 99 0 2021-01-02
2021-01-03 3 78 0 2021-01-03
2021-01-04 0 50 0 2021-01-04
to make it like this:
value B N S date
date
2020-12-31 1 11 0 2021-01-01
2021-01-01 3 80 0 2021-01-02
2021-01-02 4 99 0 2021-01-03
2021-01-03 3 78 0 2021-01-04
2021-01-04 0 50 0 2021-01-05
how can I do this?
df['date']=pd.to_datetime(df['date']).add(pd.offsets.Day(1))
df
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05
You can temporarily convert to datetime to add a DateOffset:
df['date'] = (pd.to_datetime(df['date'])
.add(pd.DateOffset(days=1))
.dt.strftime('%Y-%m-%d') # optional
)
Output:
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05
I have data that looks like this: (assume start and end are date times)
id
start
end
1
01-01
01-02
1
01-03
01-05
1
01-04
01-07
1
01-06
NaT
1
01-07
NaT
I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.
dates
count
01-01
1
01-02
0
01-03
1
01-04
2
01-05
1
01-06
2
01-07
3
One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.
Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.
d = {'id': [1, 1, 1, 1, 1],
'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
pd.Timestamp('2021-01-07')],
'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)
so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.
def dates_cc(df_):
return (
pd.get_dummies(df_['start'])
.sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
.sum()
.cumsum()
.to_frame(name='count')
.reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
.rename_axis('dates')
)
Now you can apply this function to your dataframe
res = dates_cc(df).reset_index()
print(res)
# dates count
# 0 2021-01-01 1.0
# 1 2021-01-02 0.0
# 2 2021-01-03 1.0
# 3 2021-01-04 2.0
# 4 2021-01-05 1.0
# 5 2021-01-06 2.0
# 6 2021-01-07 2.0
Now if you have several id, like
df1 = df.assign(id=[1,1,2,2,2])
print(df1)
# id start end
# 0 1 2021-01-01 2021-01-02
# 1 1 2021-01-03 2021-01-05
# 2 2 2021-01-04 2021-01-07
# 3 2 2021-01-06 NaT
# 4 2 2021-01-07 NaT
then you can use the above function like
res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 2 2021-01-04 1.0
# 6 2 2021-01-05 1.0
# 7 2 2021-01-06 2.0
# 8 2 2021-01-07 2.0
that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.
res2 = (
pd.crosstab(index=df1['id'], columns=df1['start'])
.sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
.reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
.rename_axis(columns='dates')
.cumsum(axis=1)
.stack()
.reset_index(name='count')
)
print(res2)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 1 2021-01-06 0.0
# 6 1 2021-01-07 0.0
# 7 2 2021-01-01 0.0
# 8 2 2021-01-02 0.0
# 9 2 2021-01-03 0.0
# 10 2 2021-01-04 1.0
# 11 2 2021-01-05 1.0
# 12 2 2021-01-06 2.0
# 13 2 2021-01-07 2.0
the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.
Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'id':[1,2,3,4,5],
'start_date':['2021-01-01','2021-01-01','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-04','2021-01-03','2021-01-03','2021-01-06','2021-01-08']})
plan_dates
id start_date end_date
0 1 2021-01-01 2021-01-04
1 2 2021-01-01 2021-01-03
2 3 2021-01-03 2021-01-03
3 4 2021-01-04 2021-01-06
4 5 2021-01-05 2021-01-08
I would like to create a new DataFrame with a row for each day where the plan is active, for each id.
INTENDED DF:
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
Any help would be greatly appreciated.
Use:
#first part is same like https://stackoverflow.com/a/66869805/2901002
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['start_date'] = df['start_date'].add(pd.to_timedelta(counter, unit='d'))
Then remove end_date column, rename and create default index:
df = (df.drop('end_date', axis=1)
.rename(columns={'start_date':'active_days'})
.reset_index(drop=True))
print (df)
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
I am trying to get a cumulative mean in python among different groups.
I have data as follows:
id date value
1 2019-01-01 2
1 2019-01-02 8
1 2019-01-04 3
1 2019-01-08 4
1 2019-01-10 12
1 2019-01-13 6
2 2019-01-01 4
2 2019-01-03 2
2 2019-01-04 3
2 2019-01-06 6
2 2019-01-11 1
The output I'm trying to get something like this:
id date value cumulative_avg
1 2019-01-01 2 NaN
1 2019-01-02 8 2
1 2019-01-04 3 5
1 2019-01-08 4 4.33
1 2019-01-10 12 4.25
1 2019-01-13 6 5.8
2 2019-01-01 4 NaN
2 2019-01-03 2 4
2 2019-01-04 3 3
2 2019-01-06 6 3
2 2019-01-11 1 3.75
I need the cumulative average to restart with each new id.
I can get a variation of what I'm looking for with a single, for example if the data set only had the data where id = 1 then I could use:
df['cumulative_avg'] = df['value'].expanding.mean().shift(1)
I try to add a group by into it but I get an error:
df['cumulative_avg'] = df.groupby('id')['value'].expanding().mean().shift(1)
TypeError: incompatible index of inserted column with frame index
Also tried:
df.set_index(['account']
ValueError: cannot handle a non-unique multi-index!
The actual data I have has millions of rows, and thousands of unique ids'. Any help with a speedy/efficient way to do this would be appreciated.
For many groups this will perform better because it ditches the apply. Take the cumsum divided by the cumcount, subtracting off the value to get the analog of expanding. Fortunately pandas interprets 0/0 as NaN.
gp = df.groupby('id')['value']
df['cum_avg'] = (gp.cumsum() - df['value'])/gp.cumcount()
id date value cum_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
After a groupby, you can't really chain method and in your example, the shift is not made per group anymore so you would not get the expected result. And there is a problem with index alignment after anyway so you can't create a column like this. So you can do:
df['cumulative_avg'] = df.groupby('id')['value'].apply(lambda x: x.expanding().mean().shift(1))
print (df)
id date value cumulative_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()