Pandas, group dataframe and normalize values in each group - python

I have a csv file with different groups identified by an ID, something like:
ID,X
aaa,3
aaa,5
aaa,4
bbb,50
bbb,54
bbb,52
I need to:
calculate the mean of x in each group;
divide each value of x by the mean of x for that specific group.
So, in my example above, the mean in the 'aaa' group is 4, while in 'bbb' it's 52.
I need to obtain a new dataframe with a third column, where in each row I have the original value of x divided by the group average:
ID,X,x/group_mean
aaa,3,3/4
aaa,5,5/4
aaa,4,4/4
bbb,50,50/52
bbb,54,54/52
bbb,52,52/52
I can group the dataframe and calcualte each group's mean by:
df_data = pd.read_csv('test.csv', index_col=0)
df_grouped = df_data.groupby('ID')
for group_name, group_content in df_grouped:
mean_x_group = group_content['x'].mean()
print(f'mean = {mean_x_group}')
but how do I add the third column?

Use Groupby.transform:
In [1874]: df['mean'] = df.groupby('ID').transform('mean')
In [1879]: df['newcol'] = df.X.div(df['mean'])
In [1880]: df
Out[1880]:
ID X mean newcol
0 aaa 3 4 0.750000
1 aaa 5 4 1.250000
2 aaa 4 4 1.000000
3 bbb 50 52 0.961538
4 bbb 54 52 1.038462
5 bbb 52 52 1.000000

The idea being in a neat one-liner:
df['new_column'] = df.apply(lambda row: row.X/df.loc[df.ID==row.ID, 'X'].mean(), axis=1)

One liner code to do that
# divide X with mean of X group by ID
df['group_mean'] = df.X / df.groupby('ID').transform('mean').X

Related

Pandas in Python: how to exclude results with a count == 1?

Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?
Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]

Drop a group of rows if one column has missing data in a pandas dataframe

I have the following dataframe:
df
Group Dist
0 A 5
1 B 2
2 A 3
3 B 1
4 B 0
5 A 5
I am trying to drop all rows that match Group if the Dist column equals zero. This works to delete row 4:
df = df[df.Dist != 0]
however I also want to delete rows 1 and 3 so I am left with:
df
Group Dist
0 A 5
2 A 3
5 A 5
Any ideas on how to drop the group based off this condition?
Thanks!
First get all Group values for Entry == 0 and then filter out them by check column Group with inverted mask by ~:
df1 = df[~df['Group'].isin(df.loc[df.Dist == 0, 'Group'])]
print (df1)
Group Dist
0 A 5
2 A 3
5 A 5
Or you can use GroupBy.transform with GroupBy.all for test if groups has no 0 values:
df1 = df[(df.Dist != 0).groupby(df['Group']).transform('all')]
EDIT: For remove all groups with missing values:
df2 = df[df['Dist'].notna().groupby(df['Group']).transform('all')]
For test missing values:
print (df[df['Dist'].isna()])
if return nothing there are no missing values NaN or no None like Nonetype.
So is possible check scalar, e.g. if this value is in row with index 10:
print (df.loc[10, 'Dist'])
print (type(df.loc[10, 'Dist']))
You can use groupby and the method filter:
df.groupby('Group').filter(lambda x: x['Dist'].ne(0).all())
Output:
Group Dist
0 A 5
2 A 3
5 A 5
If you want to filter out groups with missing values:
df.groupby('Group').filter(lambda x: x['Dist'].notna().all())

weighted average in pandas pivot_table

I'm trying to generate pandas pivot table that calculates an average of values in a series of data columns weighted by the values in a fixed weights column, and am struggling to find an elegant and efficient way to do this.
df = pd.DataFrame([['A',10,1],['A',20,0],['B',10,1],['B',0,0]],columns=['Group','wt','val'])
Group wt val
0 A 10 1
1 A 20 0
2 B 10 1
3 B 0 0
I want to group by Group and return both a new weight (sum of df.wt -- easy peasy) and an average of df.val weighted by df.wt to yield this:
Group weight val
0 A 30 0.333
1 B 10 1.000
In the real application there are a large number of val columns and one weight column along with other columns that I want to apply different aggfuncs to. So while I realize I could do this by direct application of groupby, it's messier. Is there a way to roll my own aggfunc within pivot_table that would computer a weighted average?
Here's an approach with groupby:
(df.assign(total=df.wt*df.val)
.groupby('Group', as_index=False)
.sum()
.assign(val=lambda x: x['total']/x['wt'])
.drop('total', axis=1)
)
Output:
Group wt val
0 A 30 0.333333
1 B 10 1.000000
Update: for all val like columns:
# toy data
df = pd.DataFrame([['A',10,1,1],['A',20,0,1],['B',10,1,2],['B',0,0,1]],
columns=['Group','wt','val_a', 'val_b'])
# grouping sum
new_df = (df.filter(like='val') # filter val columns
.mul(df.wt, axis=0) # multiply with weights
.assign(wt=df.wt) # attach weight
.groupby(df.Group).sum()
)
# loop over columns and divide the weight sum
new_df.apply(lambda x: x/new_df['wt'] if x.name != 'wt' else x)
Output:
val_a val_b wt
Group
A 0.333333 1.0 30
B 1.000000 2.0 10
This should work for multiple numerical columns:
Create a function that uses numpy average, with weights included.
Run a list comprehension on the groups in the groupby, and apply the function
Concatenate the output
df = pd.DataFrame([['A',10,1,2],['A',20,0,3],['B',10,1,2],['B',0,0,3]],columns=['Group','wt','val','vala'])
Group wt val vala
0 A 10 1 2
1 A 20 0 3
2 B 10 1 2
3 B 0 0 3
#create function
def avg(group):
df = pd.DataFrame()
for col in group.columns.drop(['Group','wt']):
A = group[col]
B = group['wt']
df['Group'] = group['Group'].unique()
df['wt'] = B.sum()
df[col] = np.average(A, weights=B)
return df
#pipe function to the group in the list comprehension
output = [group.pipe(avg) for name, group in df.groupby('Group')]
#concatenate dataframes
pd.concat(output,ignore_index=True)
Group wt val vala
0 A 30 0.333333 2.666667
1 B 10 1.000000 2.000000

Pandas how to aggregate more than one column

Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2

pandas: append new column of row subtotals

This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.

Categories