How to calculate Cumulative Average Revenue ? Python - python

I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.

This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.

You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429

Related

How to get years without starting with df=df.set_index

I have this set of dataframe:Dataframe
I can obtain the values that is 15% greater than the mean by:
df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string()
I obtained all values that are 15% greater than interest in their respective categories
The question is how do I get the year where these values occurred without starting with:
df=df.set_index('Year")
at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year")
Use .loc:
>>> df
Year Dividends Interest Other Types Rent Royalties Trade Income
0 2007 7632014 4643033 206207 626668 89715 18654926
1 2008 6718487 4220161 379049 735494 58535 29677697
2 2009 1226858 5682198 482776 1015181 138083 22712088
3 2010 978925 2229315 565625 1260765 146791 15219378
4 2011 1500621 2452712 675770 1325025 244073 19697549
5 2012 308064 2346778 591180 1483543 378998 33030888
6 2013 275019 4274425 707344 1664747 296136 17503798
7 2014 226634 3124281 891466 1807172 443671 16023363
8 2015 2171559 3474825 1144862 1858838 585733 16778858
9 2016 767713 4646350 2616322 1942102 458543 13970498
10 2017 759016 4918320 1659303 2001220 796343 9730659
11 2018 687308 6057191 1524474 2127583 1224471 19570540
>>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']]
Year Interest
0 2007 4643033
2 2009 5682198
9 2016 4646350
10 2017 4918320
11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition
df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :-
df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

Pandas DataFrame: Calculate percentage difference between rows?

I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0

Pandas groupby operations between groups

I have a DataFrame with 4 fields: Locatiom Year, Week and Sales. I would like to know the difference in Sales between two years preserving the granularity of the dataset. I mean, I would like to know for each Location, Year and Week, what is the difference to the same week of another Year.
The following will generate a Dataframe with a similar structure:
raw_data = {'Location': ['A']*30 + ['B']*30 + ['C']*30,
'Year': 3*([2018]*10+[2019]*10+[2020]*10),
'Week': 3*(3*list(range(1,11))),
'Sales': random.randint(100, size=(90))
}
df = pd.DataFrame(raw_data)
Location Year Week Sales
A 2018 1 67
A 2018 2 93
A 2018 … 67
A 2019 1 49
A 2019 2 38
A 2019 … 40
B 2018 1 18
… … … …
Could you please show me what would be the best approach?
Thank you very much
You can do it using groupby and shift:
df["Next_Years_Sales"] = df.groupby(["Location", "Week"])["Sales"].shift(-1)
df["YoY_Sales_Difference"] = df["Next_Years_Sales"] - df["Sales"]
Spot checking it:
df[(df["Location"] == "A") & (df["Week"] == 1)]
Out[37]:
Location Year Week Sales Next_Years_Sales YoY_Sales_Difference
0 A 2018 1 99 10.0 -89.0
10 A 2019 1 10 3.0 -7.0
20 A 2020 1 3 NaN NaN

How to calculate average and most frequent values per group?

I have the following df:
df =
year intensity category
2015 22 1
2015 21 1
2015 23 2
2016 25 2
2017 20 1
2017 21 1
2017 20 3
I need to group by year and calculate an average intensity and a most frequent category(per year).
I know that it's possible to calculate most frequent category as follows:
df.groupby('year')['category'].agg(lambda x: x.value_counts().index[0])
I also know how to calculate average intensity:
df = df.groupby(["year"]).agg({'intensity':'mean'}).reset_index()
But I don't know how to put everything together without join operation.
Use agg with a dictionary to define how to aggregate each column.
df.groupby('year', as_index=False)[['category', 'intensity']]\
.agg({'category': lambda x: pd.Series.mode(x)[0], 'intensity':'mean'})
Output:
year category intensity
0 2015 1 22.000000
1 2016 2 25.000000
2 2017 1 20.333333
Or you can still use lambda funcion
df.groupby('year', as_index=False)[['category','intensity']]\
.agg({'category': lambda x: x.value_counts().index[0],'intensity':'mean'})

Categories