weighted average aggregation on multiple columns of df

weighted average aggregation on multiple columns of df - python

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?

Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444

Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

Related

Pandas Groupby and remove rows with numbers that are close to each other

I have a data frame df
df =
Code Bikes Year
12 356 2020
4 378 2020
2 389 2020
35 378 2021
40 370 2021
32 350 2021
I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.
For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.
The expected out for the given example:
Code Bikes Year
12 356 2020
2 389 2020
35 378 2021
40 370 2021

new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)

You can first sort values per both columns by DataFrame.sort_values, then create groups by compare differences with treshold 3 and cumulative sum, last use DataFrameGroupBy.idxmax for get maximal indices by Bikes per Year and helper Series:
df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()
df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
Code Bikes Year
0 12 356 2020
2 2 389 2020
3 35 378 2021
4 40 370 2021

Pandas DataFrame: Calculate percentage difference between rows?

I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?

You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10

As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]

You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0

groupby().mean() don't work under for loop

I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.

I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])

Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe

Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())

How to calculate average and most frequent values per group?

I have the following df:
df =
year intensity category
2015 22 1
2015 21 1
2015 23 2
2016 25 2
2017 20 1
2017 21 1
2017 20 3
I need to group by year and calculate an average intensity and a most frequent category(per year).
I know that it's possible to calculate most frequent category as follows:
df.groupby('year')['category'].agg(lambda x: x.value_counts().index[0])
I also know how to calculate average intensity:
df = df.groupby(["year"]).agg({'intensity':'mean'}).reset_index()
But I don't know how to put everything together without join operation.

Use agg with a dictionary to define how to aggregate each column.
df.groupby('year', as_index=False)[['category', 'intensity']]\
.agg({'category': lambda x: pd.Series.mode(x)[0], 'intensity':'mean'})
Output:
year category intensity
0 2015 1 22.000000
1 2016 2 25.000000
2 2017 1 20.333333
Or you can still use lambda funcion
df.groupby('year', as_index=False)[['category','intensity']]\
.agg({'category': lambda x: x.value_counts().index[0],'intensity':'mean'})

Python Pandas: Create New Column With Calculations Based on Categorical Values in A Different Column

I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.

You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

weighted average aggregation on multiple columns of df - python

Related

Pandas Groupby and remove rows with numbers that are close to each other

Pandas DataFrame: Calculate percentage difference between rows?

groupby().mean() don't work under for loop

How to calculate average and most frequent values per group?

Python Pandas: Create New Column With Calculations Based on Categorical Values in A Different Column

Categories

Resources