weighted average in pandas pivot_table

weighted average in pandas pivot_table - python

I'm trying to generate pandas pivot table that calculates an average of values in a series of data columns weighted by the values in a fixed weights column, and am struggling to find an elegant and efficient way to do this.
df = pd.DataFrame([['A',10,1],['A',20,0],['B',10,1],['B',0,0]],columns=['Group','wt','val'])
Group wt val
0 A 10 1
1 A 20 0
2 B 10 1
3 B 0 0
I want to group by Group and return both a new weight (sum of df.wt -- easy peasy) and an average of df.val weighted by df.wt to yield this:
Group weight val
0 A 30 0.333
1 B 10 1.000
In the real application there are a large number of val columns and one weight column along with other columns that I want to apply different aggfuncs to. So while I realize I could do this by direct application of groupby, it's messier. Is there a way to roll my own aggfunc within pivot_table that would computer a weighted average?

Here's an approach with groupby:
(df.assign(total=df.wt*df.val)
.groupby('Group', as_index=False)
.sum()
.assign(val=lambda x: x['total']/x['wt'])
.drop('total', axis=1)
)
Output:
Group wt val
0 A 30 0.333333
1 B 10 1.000000
Update: for all val like columns:
# toy data
df = pd.DataFrame([['A',10,1,1],['A',20,0,1],['B',10,1,2],['B',0,0,1]],
columns=['Group','wt','val_a', 'val_b'])
# grouping sum
new_df = (df.filter(like='val') # filter val columns
.mul(df.wt, axis=0) # multiply with weights
.assign(wt=df.wt) # attach weight
.groupby(df.Group).sum()
)
# loop over columns and divide the weight sum
new_df.apply(lambda x: x/new_df['wt'] if x.name != 'wt' else x)
Output:
val_a val_b wt
Group
A 0.333333 1.0 30
B 1.000000 2.0 10

This should work for multiple numerical columns:
Create a function that uses numpy average, with weights included.
Run a list comprehension on the groups in the groupby, and apply the function
Concatenate the output
df = pd.DataFrame([['A',10,1,2],['A',20,0,3],['B',10,1,2],['B',0,0,3]],columns=['Group','wt','val','vala'])
Group wt val vala
0 A 10 1 2
1 A 20 0 3
2 B 10 1 2
3 B 0 0 3
#create function
def avg(group):
df = pd.DataFrame()
for col in group.columns.drop(['Group','wt']):
A = group[col]
B = group['wt']
df['Group'] = group['Group'].unique()
df['wt'] = B.sum()
df[col] = np.average(A, weights=B)
return df
#pipe function to the group in the list comprehension
output = [group.pipe(avg) for name, group in df.groupby('Group')]
#concatenate dataframes
pd.concat(output,ignore_index=True)
Group wt val vala
0 A 30 0.333333 2.666667
1 B 10 1.000000 2.000000

Related

Defining an aggregation function with groupby in pandas

I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})

You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2

Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2

How to set conditions in a data frame and then average the values and replace them?

I have a DataFrame.
a b
0 0.5 1
1 2#3 4
2 1 4#4
I want to set a condition for each column to check if "#" is present for each row values. And if it is, then I need to split the values and average them and replace them as new values. The values that do not have "#" do not have to go through any of the operations. My result should be:
a b
0 0.5 1
1 2.5 4
2 1 4
2#3---need to be split as 2 and 3 and the average of these values need to be taken.

You could stack + str.split + explode + astype(float) to create a MultiIndex Series of dtype float out of df. Then groupby the index, find mean and unstack to build back the DataFrame:
out = (df.stack().str.split('#').explode().astype(float)
.groupby(level=[0,1]).mean().unstack())
Output:
a b
0 0.5 1.0
1 2.5 4.0
2 1.0 4.0

Although it is not an optimal solution, one way could be as following
import pandas as pd
for col in df.columns:
for idx, i in enumerate(df[col]):
if '#' in str(i):
temp = [int(j) for j in str(i).split('#')]
avg = sum(temp) / len(temp)
df.loc[idx, col] = avg
print(df)

Pandas in Python: how to exclude results with a count == 1?

Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?

Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]

Pandas, group dataframe and normalize values in each group

I have a csv file with different groups identified by an ID, something like:
ID,X
aaa,3
aaa,5
aaa,4
bbb,50
bbb,54
bbb,52
I need to:
calculate the mean of x in each group;
divide each value of x by the mean of x for that specific group.
So, in my example above, the mean in the 'aaa' group is 4, while in 'bbb' it's 52.
I need to obtain a new dataframe with a third column, where in each row I have the original value of x divided by the group average:
ID,X,x/group_mean
aaa,3,3/4
aaa,5,5/4
aaa,4,4/4
bbb,50,50/52
bbb,54,54/52
bbb,52,52/52
I can group the dataframe and calcualte each group's mean by:
df_data = pd.read_csv('test.csv', index_col=0)
df_grouped = df_data.groupby('ID')
for group_name, group_content in df_grouped:
mean_x_group = group_content['x'].mean()
print(f'mean = {mean_x_group}')
but how do I add the third column?

Use Groupby.transform:
In [1874]: df['mean'] = df.groupby('ID').transform('mean')
In [1879]: df['newcol'] = df.X.div(df['mean'])
In [1880]: df
Out[1880]:
ID X mean newcol
0 aaa 3 4 0.750000
1 aaa 5 4 1.250000
2 aaa 4 4 1.000000
3 bbb 50 52 0.961538
4 bbb 54 52 1.038462
5 bbb 52 52 1.000000

The idea being in a neat one-liner:
df['new_column'] = df.apply(lambda row: row.X/df.loc[df.ID==row.ID, 'X'].mean(), axis=1)

One liner code to do that
# divide X with mean of X group by ID
df['group_mean'] = df.X / df.groupby('ID').transform('mean').X

Best way to iterate through dict of pandas dataframes with identical structures, to generate one dataframe with the sum of each (row, col) element?

I have a pandas dict, d1, where each value is a two-column (ID and Weight), 100-row dataframe.
I want to iterate through the dict, and for each dataframe, I want to sum all the 'Weight' values in row n, where n is the value between 1 and 100 representing the row. I then want to write the output to another dict, d2, where the key is 1-100, and the value is the sum of the values.
Example d1 value dataframe:
ID Weight
1 0.021
2 0.445
3 1.018
..
..
..
99 77.31
100 234.04
Essentially, imagine I have 10000 of these dataframes, and I want to sum all the Weight values for ID 1 across the 10000, then all the Weight values for ID 2 across the 10000, and so on up to ID 100.
I have a solution, which is basically a nested loop. It works, and it will do. However, I'm really keen to expand my basic pandas / numpy knowledge, and I wondered if there is a more pythonic way to do this?
My existing code :
for i in range (1,101):
tot = 0
for key, value in d1.items():
tot = tot + value.at[i,'Weight']
d2[i] = tot
Hugely appreciate any help and advice!

You can use pandas add function:
#create a zero filled dataframe
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
#iterate through dict and add values to df
for value in d1.values():
df = df.add(value)
You can set your ID as index via df_i = df_i.set_index('ID') and then add them all up, so that only weights are added and then df=df.reset_index() at the end.
Example:
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['ID','Weight'])
ID Weight
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame([(10,20),(30,40),(50,60)], columns=['ID','Weight'])
ID Weight
0 10 20
1 30 40
2 50 60
df3 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['ID','Weight'])
ID Weight
0 100 200
1 300 400
2 500 600
d1 = {'df1':df1,'df2':df2,'df3':df3}
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
print(df)
for value in d1.values():
df = df.add(value)
df:
ID Weight
0 111 222
1 333 444
2 555 666

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

weighted average in pandas pivot_table - python

Related

Defining an aggregation function with groupby in pandas

How to set conditions in a data frame and then average the values and replace them?

Pandas in Python: how to exclude results with a count == 1?

Pandas, group dataframe and normalize values in each group

Best way to iterate through dict of pandas dataframes with identical structures, to generate one dataframe with the sum of each (row, col) element?

Categories

Resources