I have a dataframe containing columns code, year and number_of_dues. I want to plot barplot having year on x axis and no of claims for each year on y axis for each code in one after after subplot fashion. please help me.
Sample data is given below.
Code Year No_of_dues
1 2016 100
1 2017 200
1 2018 300
2 2016 200
2 2017 300
2 2018 500
3 2016 600
3 2017 800
3 2018
Try this one:
df.groupby(['Code', 'Year'])['No_of_dues'].sum().to_frame().plot.bar()
just use seaborn.
set your x and y axes, and hue by the class you want to cohort by
I have this pandas data frame, where I want to make a line plot, per each year strata:
year month canasta
0 2011 1 239.816531
1 2011 2 239.092353
2 2011 3 239.332308
3 2011 4 237.591538
4 2011 5 238.384231
... ... ... ...
59 2015 12 295.578605
60 2016 1 296.918861
61 2016 2 296.398701
62 2016 3 296.488780
63 2016 4 300.922927
And I tried this code:
dca.groupby(['year', 'month'])['canasta'].mean().reset_index().plot()
But I get this result:
I must be doing something wrong. Please, could you help me with this plot? The x axis is the months, and there should be a line per each year.
Why: Because after you do reset_index, year and month become normal columns. And some_df.plot() simply plots all the columns of the dataframe into one plot, resulting what you posted.
Fix: Try unstack instead of reset_index:
(dca.groupby(['year', 'month'])
['canasta'].mean()
.unstack('year').plot()
)
I have a large dataset of certain events for the my research industry, organized in a dataframe as follows. Each event has an event type (str), a year of the event (int), event size (int) and an event location (str).
An example dataframe is structured below, with event types 'A', 'B', 'C', or 'D' and event locations 'CA', 'TX', 'NY'.
Event Number
Event Type
Year
Size
Location
1
A
2014
1000
CA
2
B
2014
1000
TX
3
C
2014
456
CA
4
C
2014
675
NY
5
B
2014
567
TX
6
A
2014
765
CA
7
C
2014
1000
NY
8
B
2014
675
TX
9
D
2015
3424
NY
10
A
2015
567
TX
11
A
2015
435
CA
12
C
2016
45
CA
Now, I want to plot a heatmap of event type vs year. i.e., a heatmap with year on the x axis, event type on the y-axis, and a heat color representing a count of how many of those types of events happened in that year. The resulting matrix for the above table would look something like this:
Event Type
2014
2015
2016
A
2
2
0
B
3
0
0
C
3
0
1
D
0
1
0
I have looked into using seaborn but I am not sure how to approach this sort of 2D histogram.
How would I go about it if I also wanted to plot a heatmap of location vs event type (2 strings)?
Thanks!
seaborn.histplot can produce a bivariate plot and understand categorical variables, so:
df = pd.read_clipboard()
ax = sns.histplot(data=df, x="Event Type", y="Location", cbar=True)
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
I am trying to plot a graph using the below data. I need to have graph Year vs Txns
Original Data which is dataset1= in the code
WeekDay Day Month Year Time Txns
1 5 1 2015 3 1
1 5 1 2015 4 4
1 5 1 2015 5 1
1 5 1 2015 7 2
This is the data after group by and sum which is plot= in the code
Index Txns
(2014, 12) 5786
(2015, 1) 70828
(2015, 2) 63228
(2015, 3) 74320
my code:
plot = dataset1.groupby(['Year', 'Month'])['Txns'].sum()
plot_df = plot.unstack('Month').loc[:, 'Txns']
plot_df.index = pd.PeriodIndex(plot_df.index.tolist(), freq='2015')
plot_df.plot()
I get this error everytime:
KeyError: 'the label [Txns] is not in the [columns]'
How can I fix this?
Why not just data.groupby('Year').Txns.sum() if you want to group by year and sum Txns?
and .plot() if you wanted to plot it:
or yearly lines: