Plotting in python using group by and sum - python

I am trying to plot a graph using the below data. I need to have graph Year vs Txns
Original Data which is dataset1= in the code
WeekDay Day Month Year Time Txns
1 5 1 2015 3 1
1 5 1 2015 4 4
1 5 1 2015 5 1
1 5 1 2015 7 2
This is the data after group by and sum which is plot= in the code
Index Txns
(2014, 12) 5786
(2015, 1) 70828
(2015, 2) 63228
(2015, 3) 74320
my code:
plot = dataset1.groupby(['Year', 'Month'])['Txns'].sum()
plot_df = plot.unstack('Month').loc[:, 'Txns']
plot_df.index = pd.PeriodIndex(plot_df.index.tolist(), freq='2015')
plot_df.plot()
I get this error everytime:
KeyError: 'the label [Txns] is not in the [columns]'
How can I fix this?

Why not just data.groupby('Year').Txns.sum() if you want to group by year and sum Txns?
and .plot() if you wanted to plot it:
or yearly lines:

Related

pandas plot every Nth index but always include last index

I have a plot, and I want to display only specific values. The plot looks good and not clumsy.
In the below, I want to display values every two years but I don't want miss displaying the last value.
df =
Year Total value
0 2011 11.393630
1 2012 11.379185
2 2013 10.722502
3 2014 10.304044
4 2015 9.563496
5 2016 9.048299
6 2017 9.290901
7 2018 9.470320
8 2019 9.533228
9 2020 9.593088
10 2021 9.610742
# Plot
df.plot(x='year')
# Select every other point, these values will be displayed on the chart
col_tuple = df[['Year','Total value']][::3]
for j,k in col_tuple :
plt.text(j,k*1.1,'%.2f'%(k))
plt.show()
How do I pick and show the last value as well?
I want to make sure the last value is there irrespective of the range or slice
The simplest way is to define the range/slice in reverse, e.g. [::-3]:
col_tuple = df[['Year', 'Total value']][::-3]
# Year Total value
# 10 2021 9.610742
# 7 2018 9.470320
# 4 2015 9.563496
# 1 2012 11.379185
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')
If you want to ensure both the last and first index, use Index.union to combine the (forward) sliced index and last index:
idx = df.index[::3].union([df.index[-1]])
col_tuple = df[['Year', 'Total value']].iloc[idx]
# Year Total value
# 0 2011 11.393630
# 3 2014 10.304044
# 6 2017 9.290901
# 9 2020 9.593088
# 10 2021 9.610742
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')

How to create a column whose values are based on the values of another column?

I have a df like this:
Year 2016 2017
Month
1 0.979000 1.109000
2 0.974500 1.085667
3 1.004000 1.075667
4 1.027333 1.184000
5 1.049000 1.089000
6 1.013250 1.085500
7 0.999000 1.059000
8 0.996667 1.104000
9 1.024000 1.121333
10 1.019000 1.126333
11 0.949000 1.183000
12 1.074000 1.203000
How can I add a 'Season' column that populates "Spring", "Summer" etc. based on the numerical value of month? E.g months 12, 1, and 2 = Winter, etc?
You could use np.select with pd.Series.between:
import numpy as np
df["Season"] = np.select([df["Month"].between(3, 5),
df["Month"].between(6, 8),
df["Month"].between(9, 11)],
["Spring", "Summer", "Fall"],
"Winter")
Month 2016 2017 Season
0 1 0.979000 1.109000 Winter
1 2 0.974500 1.085667 Winter
2 3 1.004000 1.075667 Spring
3 4 1.027333 1.184000 Spring
4 5 1.049000 1.089000 Spring
5 6 1.013250 1.085500 Summer
6 7 0.999000 1.059000 Summer
7 8 0.996667 1.104000 Summer
8 9 1.024000 1.121333 Fall
9 10 1.019000 1.126333 Fall
10 11 0.949000 1.183000 Fall
11 12 1.074000 1.203000 Winter
You could iterate through the column, appending data to a new data frame which you will add in as a column.
for i in df['Year Month'] :
if i == 12 or 1 or 2 :
i = "Winter"
df2.append(i)
Then add on your other conditions with elif and else statements and you should be good to add it onto your main df afterwards. Lemme know if this helps.

How to calculate Cumulative Average Revenue ? Python

I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429

Plotting bar graph by month - matplotlib

I have a dataset that is in the following form:
Date A B C
01/04/2012 2 5 Y
05/04/2012 3 4 Y
06/05/2012 7 6 Y
09/05/2012 8 2 N
11/05/2012 1 4 Y
15/06/2012 5 4 Y
That continues on with more rows.
I want to plot a bar chart with date on the bottom axis converted to show just the month (i.e. April, May, July) and then on the y-axis I want the average of the sum of the A and B column so for the month of April it would be 7 (14 total over two instances) and for May it would be 9.33 (28 total over 3 instances).
I'm really struggling with how to do this and I'd prefer not to create another column that sums up A and B.
You can use groupby on month_name then mean+eval:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.groupby([df['Date'].dt.month_name()], sort=False).mean().eval('A+B')\
.plot(kind='bar')
print(df.groupby([df['Date'].dt.month_name()], sort=False).mean().eval('A+B'))
Date
April 7.000000
May 9.333333
June 9.000000
dtype: float64

Plotting for repeated values using loops Python

I have some data that looks like data = pd.read_csv(....):
Year Month HOUR NAME RATE
2010 1 0 Big 222
2010 1 0 Welsch Power 434
2010 1 0 Cottonwood 124
2010 1 1 Big 455
2010 1 1 Welsch Power 900
2010 1 1 Cottonwood 110
.
.
.
2010 2 0 Big 600
2010 2 0 Welsch Power 1000
2010 2 0 Cottonwood 170
.
.
2010 3 0 Big 400
2010 3 0 Welsch Power 900
2010 3 0 Cottonwood 110
As you can see the HOUR ( 0 - 23 ) repeats itself every Month ( 0 - 12 ). I need a way to loop through values so I can plot the RATE (Y-Axis) every Month by the HOUR (X-Axis) for each NAME.
My attempt looks like:
for name, data in data.groupby('NAME'):
fig = plt.figure(figsize=(14,23))
plt.subplot(211)
plt.plot(data['HOUR'], data['RATE'], label=name)
plt.xlabel('Hour')
plt.ylabel('Rate')
plt.legend()
plt.show()
plt.close()
This works but because HOUR repeats every change in month the graphs end up back at 0 every time it loops. I want to have each of the 12 Months as separate lines in different colors for each NAME on one graph but currently they look like this:
.pivot your DataFrame after the groupby so it will plot each month as a different line:
import matplotlib.pyplot as plt
for name, gp in df.groupby(['NAME']):
fig, ax = plt.subplots()
gp.pivot(index='HOUR', columns='Month', values='RATE').plot(ax=ax, marker='o', title=name)
plt.show()

Categories