I would like to plot multiple lines in Python for this dataset: (x = year, y = freq)
Student_ID Year Freq
A 2012 6
B 2008 22
C 2009 18
A 2010 7
B 2012 13
D 2012 31
D 2013 1
where each student_id has data for different years. count is the result of a groupby.
I would like to have one line for each student_id.
I have tried with this:
df.groupby(['year'])['freq'].count().plot()
but it does not plot one line for each student_id.
Any suggestions are more than welcome. Thank you for your help
I wasn't sure from your question if you wanted count (which in your example is all ones) or sum, so I solved this with sum - if you'd like count, just swap it out for sum in the first line.
df_ = df.groupby(['Student_ID', 'Year'])['Freq'].sum()
print(df_)
> Student_ID Year
A 2010 7
2012 6
B 2008 22
2012 13
C 2009 18
D 2012 31
2013 1
Name: Freq, dtype: int64
fig, ax = plt.subplots()
for student in set(a[0] for a in df_.index):
df_[student].plot(ax=ax, label=student)
plt.legend()
plt.show()
Which gives you:
Related
I have this pandas data frame, where I want to make a line plot, per each year strata:
year month canasta
0 2011 1 239.816531
1 2011 2 239.092353
2 2011 3 239.332308
3 2011 4 237.591538
4 2011 5 238.384231
... ... ... ...
59 2015 12 295.578605
60 2016 1 296.918861
61 2016 2 296.398701
62 2016 3 296.488780
63 2016 4 300.922927
And I tried this code:
dca.groupby(['year', 'month'])['canasta'].mean().reset_index().plot()
But I get this result:
I must be doing something wrong. Please, could you help me with this plot? The x axis is the months, and there should be a line per each year.
Why: Because after you do reset_index, year and month become normal columns. And some_df.plot() simply plots all the columns of the dataframe into one plot, resulting what you posted.
Fix: Try unstack instead of reset_index:
(dca.groupby(['year', 'month'])
['canasta'].mean()
.unstack('year').plot()
)
Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
This question already has answers here:
pandas groupby where you get the max of one column and the min of another column
(2 answers)
Closed 4 years ago.
I have a big dataframe with a structure like this:
ID Year Consumption
1 2012 24
2 2012 20
3 2012 21
1 2013 22
2 2013 23
3 2013 24
4 2013 25
I want another DataFrame that contains first year of appearence, and max consumption of all time per ID like this:
ID First_Year Max_Consumption
1 2012 24
2 2012 23
3 2012 24
4 2013 25
Is there a way to extract this data without using loops? I have tried this:
year = list(set(df.Year))
ids = list(set(df.ID))
antiq = list()
max_con = list()
for i in ids:
df_id = df[df['ID'] == i]
antiq.append(min(df_id['Year']))
max_con.append(max(df_id['Consumption']))
But it's too slow. Thank you!
Use GroupBy + agg:
res = df.groupby('ID', as_index=False).agg({'Year': 'min', 'Consumption': 'max'})
print(res)
ID Year Consumption
0 1 2012 24
1 2 2012 23
2 3 2012 24
3 4 2013 25
Another alternative to groupby is pivot_table:
pd.pivot_table(df, index="ID", aggfunc={"Year":min, "Consumption":max})
Trying to create new dataframe columns from the contents of an existing column. Easier to explain with an example. I would like to convert this:
. Yr Month Class Cost
1 2015 1 L 19.2361
2 2015 1 M 29.4723
3 2015 1 S 48.5980
4 2015 1 T 169.7630
5 2015 2 L 19.1506
6 2015 2 M 30.0886
7 2015 2 S 49.3765
8 2015 2 T 167.0000
9 2015 3 L 19.3465
10 2015 3 M 29.1991
11 2015 3 S 46.2580
12 2015 3 T 157.7916
13 2015 4 L 18.3165
14 2015 4 M 28.2314
15 2015 4 S 44.5844
16 2015 4 T 162.3241
17 2015 5 L 17.4556
18 2015 5 M 27.0434
19 2015 5 S 42.8841
20 2015 5 T 159.3457
21 2015 6 L 16.5343
22 2015 6 M 24.9853
23 2015 6 S 40.5612
24 2015 6 T 153.4902
...into the following so that I can plot 4 separate lines [L, M, S, T]:
. Yr Month L M S T
1 2015 1 19.2361 29.4723 48.5980 169.7630
2 2015 2 19.1506 30.0886 49.3765 167.0000
3 2015 3 19.3465 29.1991 46.2580 157.7916
4 2015 4 18.3165 28.2314 44.5844 162.3241
5 2015 5 17.4556 27.0434 42.8841 159.3457
6 2015 6 16.5343 24.9853 40.5612 153.4902
I was able to work through it in what feels like a very clumsy way, by filtering the dataframe on the 'class' column... and then 3 separate merges.
list_class = ['L', 'M', 'S', 'T']
year = 'Yr'
month = 'Month'
df_class = pd.DataFrame()
df_class1 = pd.DataFrame()
df_class2 = pd.DataFrame()
df_class1 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[0]], df[[month, year, 'Class','Cost']][df['Class']==list_class[1]], \
left_on=[month, year], right_on=[month, year])
df_class2 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[2]], df[[month, year, 'Class','Cost']][df['Class']==list_class[3]], \
left_on=[month, year], right_on=[month, year])
df_class = merge(df_class1, df_class2, left_on=[month, year], right_on=[month, year]).groupby([year, month]).mean().plot(figsize(15,8))
There must be a more efficient way. Feels like it should be done with groupby, but I couldn't nail it down.
You can first convert the df to a multi-level index type and then unstack the level Class will give you what you want. Suppose df is the original dataframe shown on the very beginning of your post.
df.set_index(['Yr', 'Month', 'Class'])['Cost'].unstack('Class')
Out[29]:
Class L M S T
Yr Month
2015 1 19.2361 29.4723 48.5980 169.7630
2 19.1506 30.0886 49.3765 167.0000
3 19.3465 29.1991 46.2580 157.7916
4 18.3165 28.2314 44.5844 162.3241
5 17.4556 27.0434 42.8841 159.3457
6 16.5343 24.9853 40.5612 153.4902