I'm trying to generate pandas pivot table that calculates an average of values in a series of data columns weighted by the values in a fixed weights column, and am struggling to find an elegant and efficient way to do this.
df = pd.DataFrame([['A',10,1],['A',20,0],['B',10,1],['B',0,0]],columns=['Group','wt','val'])
Group wt val
0 A 10 1
1 A 20 0
2 B 10 1
3 B 0 0
I want to group by Group and return both a new weight (sum of df.wt -- easy peasy) and an average of df.val weighted by df.wt to yield this:
Group weight val
0 A 30 0.333
1 B 10 1.000
In the real application there are a large number of val columns and one weight column along with other columns that I want to apply different aggfuncs to. So while I realize I could do this by direct application of groupby, it's messier. Is there a way to roll my own aggfunc within pivot_table that would computer a weighted average?
Here's an approach with groupby:
(df.assign(total=df.wt*df.val)
.groupby('Group', as_index=False)
.sum()
.assign(val=lambda x: x['total']/x['wt'])
.drop('total', axis=1)
)
Output:
Group wt val
0 A 30 0.333333
1 B 10 1.000000
Update: for all val like columns:
# toy data
df = pd.DataFrame([['A',10,1,1],['A',20,0,1],['B',10,1,2],['B',0,0,1]],
columns=['Group','wt','val_a', 'val_b'])
# grouping sum
new_df = (df.filter(like='val') # filter val columns
.mul(df.wt, axis=0) # multiply with weights
.assign(wt=df.wt) # attach weight
.groupby(df.Group).sum()
)
# loop over columns and divide the weight sum
new_df.apply(lambda x: x/new_df['wt'] if x.name != 'wt' else x)
Output:
val_a val_b wt
Group
A 0.333333 1.0 30
B 1.000000 2.0 10
This should work for multiple numerical columns:
Create a function that uses numpy average, with weights included.
Run a list comprehension on the groups in the groupby, and apply the function
Concatenate the output
df = pd.DataFrame([['A',10,1,2],['A',20,0,3],['B',10,1,2],['B',0,0,3]],columns=['Group','wt','val','vala'])
Group wt val vala
0 A 10 1 2
1 A 20 0 3
2 B 10 1 2
3 B 0 0 3
#create function
def avg(group):
df = pd.DataFrame()
for col in group.columns.drop(['Group','wt']):
A = group[col]
B = group['wt']
df['Group'] = group['Group'].unique()
df['wt'] = B.sum()
df[col] = np.average(A, weights=B)
return df
#pipe function to the group in the list comprehension
output = [group.pipe(avg) for name, group in df.groupby('Group')]
#concatenate dataframes
pd.concat(output,ignore_index=True)
Group wt val vala
0 A 30 0.333333 2.666667
1 B 10 1.000000 2.000000
Related
I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})
You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2
Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2
I have a DataFrame.
a b
0 0.5 1
1 2#3 4
2 1 4#4
I want to set a condition for each column to check if "#" is present for each row values. And if it is, then I need to split the values and average them and replace them as new values. The values that do not have "#" do not have to go through any of the operations. My result should be:
a b
0 0.5 1
1 2.5 4
2 1 4
2#3---need to be split as 2 and 3 and the average of these values need to be taken.
You could stack + str.split + explode + astype(float) to create a MultiIndex Series of dtype float out of df. Then groupby the index, find mean and unstack to build back the DataFrame:
out = (df.stack().str.split('#').explode().astype(float)
.groupby(level=[0,1]).mean().unstack())
Output:
a b
0 0.5 1.0
1 2.5 4.0
2 1.0 4.0
Although it is not an optimal solution, one way could be as following
import pandas as pd
for col in df.columns:
for idx, i in enumerate(df[col]):
if '#' in str(i):
temp = [int(j) for j in str(i).split('#')]
avg = sum(temp) / len(temp)
df.loc[idx, col] = avg
print(df)
Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?
Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]
I have a csv file with different groups identified by an ID, something like:
ID,X
aaa,3
aaa,5
aaa,4
bbb,50
bbb,54
bbb,52
I need to:
calculate the mean of x in each group;
divide each value of x by the mean of x for that specific group.
So, in my example above, the mean in the 'aaa' group is 4, while in 'bbb' it's 52.
I need to obtain a new dataframe with a third column, where in each row I have the original value of x divided by the group average:
ID,X,x/group_mean
aaa,3,3/4
aaa,5,5/4
aaa,4,4/4
bbb,50,50/52
bbb,54,54/52
bbb,52,52/52
I can group the dataframe and calcualte each group's mean by:
df_data = pd.read_csv('test.csv', index_col=0)
df_grouped = df_data.groupby('ID')
for group_name, group_content in df_grouped:
mean_x_group = group_content['x'].mean()
print(f'mean = {mean_x_group}')
but how do I add the third column?
Use Groupby.transform:
In [1874]: df['mean'] = df.groupby('ID').transform('mean')
In [1879]: df['newcol'] = df.X.div(df['mean'])
In [1880]: df
Out[1880]:
ID X mean newcol
0 aaa 3 4 0.750000
1 aaa 5 4 1.250000
2 aaa 4 4 1.000000
3 bbb 50 52 0.961538
4 bbb 54 52 1.038462
5 bbb 52 52 1.000000
The idea being in a neat one-liner:
df['new_column'] = df.apply(lambda row: row.X/df.loc[df.ID==row.ID, 'X'].mean(), axis=1)
One liner code to do that
# divide X with mean of X group by ID
df['group_mean'] = df.X / df.groupby('ID').transform('mean').X
I have a pandas dict, d1, where each value is a two-column (ID and Weight), 100-row dataframe.
I want to iterate through the dict, and for each dataframe, I want to sum all the 'Weight' values in row n, where n is the value between 1 and 100 representing the row. I then want to write the output to another dict, d2, where the key is 1-100, and the value is the sum of the values.
Example d1 value dataframe:
ID Weight
1 0.021
2 0.445
3 1.018
..
..
..
99 77.31
100 234.04
Essentially, imagine I have 10000 of these dataframes, and I want to sum all the Weight values for ID 1 across the 10000, then all the Weight values for ID 2 across the 10000, and so on up to ID 100.
I have a solution, which is basically a nested loop. It works, and it will do. However, I'm really keen to expand my basic pandas / numpy knowledge, and I wondered if there is a more pythonic way to do this?
My existing code :
for i in range (1,101):
tot = 0
for key, value in d1.items():
tot = tot + value.at[i,'Weight']
d2[i] = tot
Hugely appreciate any help and advice!
You can use pandas add function:
#create a zero filled dataframe
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
#iterate through dict and add values to df
for value in d1.values():
df = df.add(value)
You can set your ID as index via df_i = df_i.set_index('ID') and then add them all up, so that only weights are added and then df=df.reset_index() at the end.
Example:
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['ID','Weight'])
ID Weight
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame([(10,20),(30,40),(50,60)], columns=['ID','Weight'])
ID Weight
0 10 20
1 30 40
2 50 60
df3 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['ID','Weight'])
ID Weight
0 100 200
1 300 400
2 500 600
d1 = {'df1':df1,'df2':df2,'df3':df3}
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
print(df)
for value in d1.values():
df = df.add(value)
df:
ID Weight
0 111 222
1 333 444
2 555 666