I have the following df:
df =
year intensity category
2015 22 1
2015 21 1
2015 23 2
2016 25 2
2017 20 1
2017 21 1
2017 20 3
I need to group by year and calculate an average intensity and a most frequent category(per year).
I know that it's possible to calculate most frequent category as follows:
df.groupby('year')['category'].agg(lambda x: x.value_counts().index[0])
I also know how to calculate average intensity:
df = df.groupby(["year"]).agg({'intensity':'mean'}).reset_index()
But I don't know how to put everything together without join operation.
Use agg with a dictionary to define how to aggregate each column.
df.groupby('year', as_index=False)[['category', 'intensity']]\
.agg({'category': lambda x: pd.Series.mode(x)[0], 'intensity':'mean'})
Output:
year category intensity
0 2015 1 22.000000
1 2016 2 25.000000
2 2017 1 20.333333
Or you can still use lambda funcion
df.groupby('year', as_index=False)[['category','intensity']]\
.agg({'category': lambda x: x.value_counts().index[0],'intensity':'mean'})
Related
I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)
I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
I have a Data Frame with columns: Year and Min Delay. Sample rows as follows:
2014 0
2014 2
2014 0
2014 4
2015 4
2015 4
2015 2
2015 2
I want to group this dataframe by year and find the delay ratio per year (i.e. number of non-zero entries that year divided by total number of entries for that year). So if we consider the data frame above, what I am trying to get is:
2014 0.5
2015 1
(There are 2 delays in 2014, total 4, 4 delays in 2015 total 4. A delay is defined by Min Delay > 0)
This is what I tried:
def find_ratio(df):
ratio = 1 - (len(df[df == 0]) / len(df))
return ratio
print(df.groupby(["Year"])["Min Delay"].transform(find_ratio).unique())
which prints: [0.5 1]
How can I get a data frame instead of an array?
First I think unique is not good idea use here. Because if need assign output of function to years, it is impossible.
Also transform is good idea if need new column to DataFrame, not aggregated DataFrame.
I think need GroupBy.apply, also function should be simplify by mean of boolean mask:
def find_ratio(df):
ratio = (df != 0).mean()
return ratio
print(df.groupby(["Year"])["Min Delay"].apply(find_ratio).reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with lambda function:
print (df.groupby(["Year"])["Min Delay"]
.apply(lambda x: (x != 0).mean())
.reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with GroupBy.transform return new column:
df['ratio'] = df.groupby(["Year"])["Min Delay"].transform(find_ratio)
print (df)
Year Min Delay ratio
0 2014 0 0.5
1 2014 2 0.5
2 2014 0 0.5
3 2014 4 0.5
4 2015 4 0.0
5 2015 4 0.0
6 2015 2 0.0
7 2015 2 0.0
id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014