pandas get average of a groupby

pandas get average of a groupby - python

I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3

Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.

df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)

Related

Sum all values row-wise conditionally grouped by id

My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final

Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0

How to sum dataframe in pandas for more than 5

in my days column if my number is more than 5 I want to sum that column
Input:
Days files
1 10
3 20
4 30
5 40
6 50
Required output:
Days files
1 10
3 20
4 30
5+ 90

You can try series.clip to clip the upperbound in the series , then do a groupby and sum:
out = df['files'].groupby(df['Days'].clip(upper=5)).sum().reset_index()
print(out)
Days files
0 1 10
1 3 20
2 4 30
3 5 90
If you really want to change anything above 5 into a str type , you can jst replace the 5 using the above logic:
out = df['files'].groupby(df['Days'].clip(upper=5).replace(5,'5+')).sum().reset_index()
print(out)
Days files
0 1 10
1 3 20
2 4 30
3 5+ 90

I'd do it like this:
above = df.files[df.Days >= 5].sum()
df[df.Days < 5].append({'Days': '5+', 'files': above}, ignore_index=True)
It gives:
Days files
0 1 10
1 3 20
2 4 30
3 5+ 90

calculate the amount spend based on every month which depends on another column value ID

I am trying to get the amount spent on each type ID based on the month column
Dataset :
ID TYPE_ID Month_year Amount
100 1 jun_2019 20
100 1 jul_2019 30
100 2 jun_2019 10
200 1 jun_2019 50
200 1 jun_2019 30
100 2 jul_2019 20
200 2 jun_2019 40
200 2 jul_2019 10
200 2 jun_2019 20
200 1 jul_2019 30
100 1 jul_2019 10
Output :
Based on every type ID, I want to calculate the spend depending on the month . The column value TYPEID_1_jun2019 tells me the no of transactions done in that particular month. Amount_type1_jun2019 tells me the total amount spend in every month based on my type ID.
ID TYPEID_1_jun2019 Amount_type1_jun2019 TYPEID_1_jul2019 Amount_type1_jul2019 TYPEID_2_jun2019 Amount_type2_jun2019 TYPEID_2_jul2019 Amount_type2_jul2019
100 1 20 2 40 1 10 1 20
200 1 80 1 30 2 60 1 10
EDIT : I also want to calculate the average monthly spent for every ID
Output : Also include these columns,
ID Average_type1_jul2019 Average_type1_jun2019
100 20 10
The formula I used to calculate the average is amount spent in july with type ID 1 divided by the total months.

First convert Month_year to datetimes for correct order, then create helper column type and aggregate sum with size, reshape by DataFrame.unstack, sorting by DataFrame.sort_index and last flatten MultiIndex with datetimes to original format:
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2])
.sort_index(axis=1, level=[1,2]))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df1 = df1.reset_index()
print (df1)
ID Amount_1_Jun_2019 type_1_Jun_2019 Amount_2_Jun_2019 \
0 100 20 1 10
1 200 80 2 60
type_2_Jun_2019 Amount_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 \
0 1 40 2 20
1 2 30 1 10
type_2_Jul_2019
0 1
1 1
EDIT:
#removed sorting anf flatteting MultiIndex
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2]))
print (df1)
Amount type
Month_year 2019-06-01 2019-07-01 2019-06-01 2019-07-01
TYPE_ID 1 2 1 2 1 2 1 2
ID
100 20 10 40 20 1 1 2 1
200 80 60 30 10 2 2 1 1
#get number of unique mmonth_year per ID and type and divided by Amount
df2 = df.groupby(['ID','TYPE_ID'])['Month_year'].nunique().unstack()
df3 = df1.xs('Amount', axis=1, level=0).div(df2, level=1)
#added top level Average
df3.columns = pd.MultiIndex.from_tuples([('Average', a, b) for a, b in df3.columns])
print (df3)
Average
2019-06-01 2019-07-01
1 2 1 2
ID
100 10.0 5.0 20.0 10.0
200 40.0 30.0 15.0 5.0
#join together, sorting and flatten MultiIndex
df5 = pd.concat([df1, df3],axis=1).sort_index(axis=1, level=[1,2])
df5.columns = df5.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df5 = df5.reset_index()
print (df5)
ID Amount_1_Jun_2019 Average_1_Jun_2019 type_1_Jun_2019 \
0 100 20 10.0 1
1 200 80 40.0 2
Amount_2_Jun_2019 Average_2_Jun_2019 type_2_Jun_2019 Amount_1_Jul_2019 \
0 10 5.0 1 40
1 60 30.0 2 30
Average_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 Average_2_Jul_2019 \
0 20.0 2 20 10.0
1 15.0 1 10 5.0
type_2_Jul_2019
0 1
1 1

Check if value in a dataframe is between two values in another dataframe

I have a pretty similiar question to another question on here.
Let's assume I have two dataframes:
df
volumne
11
24
30
df2
range_low range_high price
10 20 1
21 30 2
How can I filter the second dataframe, based for one row of the first dataframe, if the value range is true?
So for example (value 11 from df) leads to:
df3
range_low range_high price
10 20 1
wherelese (value 30 from df) leads to:
df3
I am looking for a way to check, if a specific value is in a range of another dataframe, and filter the dataframe based on this condition. In none python code:
Find 11 in
(10, 20), if True: df3 = filter on this row
(21, 30), if True: df3= filter on this row
if not
return empty frame

For loop solution use:
for v in df['volumne']:
df3 = df2[(df2['range_low'] < v) & (df2['range_high'] > v)]
print (df3)
For non loop solution is possible use cross join, but if large DataFrames there should be memory problem:
df = df.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
print (df)
volumne a range_low range_high price
0 11 1 10 20 1
1 11 1 21 30 2
2 24 1 10 20 1
3 24 1 21 30 2
4 30 1 10 20 1
5 30 1 21 30 2
df3 = df[(df['range_low'] < df['volumne']) & (df['range_high'] > df['volumne'])]
print (df3)
volumne a range_low range_high price
0 11 1 10 20 1
3 24 1 21 30 2

I have a similar problem (but with date ranges), and if df2 is too large, it will take for ever.
If the volumes are always integers, a faster solution is to create an intermediate dataframe where you associate each possible volume to a price (in one iteration) and then merge.
price_list=[]
for index, row in df2.iterrows():
x=pd.DataFrame(range(row['range_low'],row['range_high']+1),columns=['volume'])
x['price']=row['price']
price_list.append(x)
df_prices=pd.concat(price_list)
you will get something like this
volume price
0 10 1
1 11 1
2 12 1
3 13 1
4 14 1
5 15 1
6 16 1
7 17 1
8 18 1
9 19 1
10 20 1
0 21 2
1 22 2
2 23 2
3 24 2
4 25 2
5 26 2
6 27 2
7 28 2
8 29 2
9 30 2
then you can quickly associate associate a price to each volume in df
df.merge(df_prices,on='volume')
volume price
0 11 1
1 24 2
2 30 2

Calculate difference sequentially by groups in pandas

I'm trying to calculate the difference between two columns sequentially as efficiently as possible. My DataFrame looks like this:
category sales initial_stock
1 2 20
1 6 20
1 1 20
2 4 30
2 6 30
2 5 30
2 7 30
And I want to calculate a variable final_stock, like this:
category sales initial_stock final_stock
1 2 20 18
1 6 20 12
1 1 20 11
2 4 30 26
2 6 30 20
2 5 30 15
2 7 30 8
Thus, final_stock first equals initial_stock - sales and the it equals final_stock.shift() - sales, for each category. I managed to do this with for loops, but it is quite slow and my feeling says there's probably a one or two liner solution to this problem. Do you have any ideas?
Thanks

Use groupby and cumsum on "sales" to get the cumulative stock sold per category, then subtract from "initial_stock":
df['final_stock'] = df['initial_stock'] - df.groupby('category')['sales'].cumsum()
df
category sales initial_stock final_stock
0 1 2 20 18
1 1 6 20 12
2 1 1 20 11
3 2 4 30 26
4 2 6 30 20
5 2 5 30 15
6 2 7 30 8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas get average of a groupby - python

df_monthly_average = ( df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"] .sum() .reset_index() .groupby("Revenue") .mean() .reset_index() )

Related

Sum all values row-wise conditionally grouped by id

How to sum dataframe in pandas for more than 5

calculate the amount spend based on every month which depends on another column value ID

Check if value in a dataframe is between two values in another dataframe

Calculate difference sequentially by groups in pandas

Categories

Resources